Focus

Prisma AIRS

AI Red Teaming Overview

Table of Contents

Create a Deployment Profile for Prisma AIRS AI Red Teaming

AI Red Teaming Overview

Learn about Prisma AIRS AI Red Teaming.

Where Can I Use This?	What Do I Need?
Prisma AIRS	Red Teaming Deployment Profile

Prisma AIRS supports automated AI Red Teaming. It scans any AI system (LLM or application powered by LLM) for safety and security vulnerabilities. There are a few key components of a red teaming exercise using this tool:

Target - endpoint of the AI system scanned for vulnerabilities is called a target.
Scan - one complete assessment of an AI system using AI Red Teaming is considered as a scan. A scan is carried out by sending attack payloads to an AI system in the form of attack prompts. Three modes of scanning an AI system are provided: Attack Library, Agent and Custom attacks.
Report - the findings of any given AI Red Teaming scan are presented in the form of a Scan Report.

Key Concepts

There are a few concepts to consider when using AI Red Teaming; targets and scans.

Target

A Target is the system or endpoint you want to perform red teaming on using AI Red Teaming. It serves as the focal point for testing and evaluating the security and resilience of your application or model. A target in AI Red Teaming can be:

Models. First party or third party models with a defined endpoint for simulation.
Applications. AI powered systems designed for specific tasks or objectives.
Agents. Specific application subtype where AI models are in charge of the control flow.

AI Red Teaming is designed to work seamlessly with REST APIs and Streaming APIs.

This flexibility allows you to test a wide range of targets, ensuring comprehensive red teaming capabilities for your LLM and LLM-powered applications.

Scan

A Scan represents a complete assessment of an AI system. During a scan, AI Red Teaming evaluates the system's security and robustness by sending carefully crafted attack payloads, also known as attack prompts, to the Target.

AI Red Teaming provides three distinct modes for scanning AI systems:

Attack library scan. This scan uses a curated and regularly updated list of predefined attack scenarios. These attacks are designed based on known vulnerabilities and best practices in red teaming.
Agent scan. This scan utilizes dynamic attack generation powered by an LLM attacker. This mode allows for real-time generation of attack payloads, making it highly adaptive to the specific behavior and responses of the Target.
Custom attack scan. This scan allows you to upload and run your own prompt sets against target LLM endpoints alongside AI Red Teaming's built-in attack library.

By combining these modes, AI Red Teaming ensures a thorough and effective assessment of your AI system's defenses.

How Prisma AI Red Teaming Works

Prisma AI Red Teaming interacts with your applications and models, referred to as Targets, in much the same way as an end user would. This interaction enables AI Red Teaming to simulate realistic scenarios and identify vulnerabilities or weaknesses in your application's or model's behavior. By mimicking end-user actions, it ensures that its findings are relevant and applicable to real-world use cases.

Single-Tenant Deployment for Security and Privacy

One of the core features of AI Red Teaming is its single-tenant deployment model. This approach ensures that:

Complete Isolation: All compute resources and data associated with a customer are completely isolated from other customers.
Enhanced Security: No customer data ever mixes with another, eliminating the risk of cross-tenant data leaks or unauthorized access.
Private Environment: Each customer operates in their own secure environment, providing peace of mind for sensitive applications or models that handle proprietary or confidential information.

Attack Library-based Red Teaming

In this mode, AI Red Teaming uses a proprietary attack library which is constantly updated to simulate attacks on any AI system. Key aspects of an attack library scan include:

Attack categories
Attack severities
Risk Score

Attack Categories

Each Attack Category contains a range of techniques. A prompt can incorporate techniques from multiple categories to enhance its Attack Success Rate (ASR) and is classified into categories based on the techniques it uses. The attack library currently has 3 categories of attacks which also undergo regular updates; Security, Safety, and Compliance:

Security. Represents security vulnerabilities and potential exploits. This category includes the following:
- Adversarial Suffix Prompts. These prompts are created by appending an adversarial suffix—a series of tokens added to the end of a user query that leads to unintended behavior from the model. The adversarial suffixes are learned during the model's training phase, which, when used with a harmful user query, cause the model to deviate from its intended purpose and generate harmful and undesirable outputs.
- Evasion. These prompts use various obfuscation techniques, such as Base64, Leetspeak, ROT13, and other types of ciphers, allowing harmful content to be disguised in a way that prevents the model from identifying and flagging it as harmful or inappropriate. This manipulation enables prompts to bypass the model's safety filters, thereby allowing the extraction of harmful or malicious data.
- Indirect Prompt Injection. These prompts operate by embedding malicious instructions within external data sources such as web pages or documents that are subsequently referenced in text prompts. When the model processes the external data source, its behavior becomes compromised by the concealed instructions, leading to unintended responses and potentially harmful content generation. This technique is particularly effective as safety guardrails typically don't scan external content for malicious instructions.
- Jailbreak. These prompts are designed to make the model disobey its instructions and generate harmful content. They utilize a range of techniques, including role-playing or constructing imaginary environments where it might appear permissible for the model to produce such content, thereby leading to potential violations of ethical guidelines.
- Multi-turn. These prompts exploit the conversational nature of AI systems by gradually manipulating the model across multiple interactions. Rather than attempting to bypass safety measures in a single prompt, attack prompts build context over several exchanges, subtly steering the conversation toward harmful outputs. This technique leverages the model's ability to store context of past messages, allowing seemingly benign individual messages to collectively produce unintended behavior.
- Prompt Injection. As the name suggests, this involves "injecting" the user prompt into the system prompt. Various strategies, such as leading statements and masking, are used to confuse the model. The goal is to make it difficult for the model to distinguish between the malicious user prompt and the system prompt, thereby allowing the malicious input to influence its behavior.
- Remote Code Execution. These prompts target AI systems with code execution capabilities, deliberately manipulating the model to run malicious or unauthorized code. The attack focuses on exploiting the system's ability to interpret and execute commands, potentially gaining access to sensitive information such as system logs, user data, or other restricted resources.
- System Prompt Leak. These prompts are crafted to trick the model into disclosing its internal instructions, which are usually kept hidden from users. The system prompt can include sensitive data and may also contain behavioral guidelines that dictate how the model should respond or interact with users.
- Tool Leak. These prompts target AI systems with tool capabilities, attempting to extract information about the system's available tools. The attack aims to reveal tool schemas, parameters, function call definitions, and other privileged technical details that are not intended to be disclosed to users. This technique can expose sensitive information about the system's architecture and capabilities.
Safety. These prompts involve direct questions that are unethical or harmful by nature. Their aim is to test the model's basic safety alignment by asking clear, straightforward questions that violate ethical guidelines. This category contains the following:
- Bias. These prompts attempt to elicit responses that show discrimination toward specific groups based on characteristics such as race, gender, religion, or nationality
- CBRN. These prompts seek information about chemical, biological, radiological, or nuclear weapons, including their creation, deployment, or use for harmful purposes.
- Cybercrime. These prompts relate to various cyber crimes such as hacking, phishing, identity theft, or other malicious online activities.
- Drugs. These prompts solicit information about illegal drug production, distribution, or use.
- Non-violent Crimes. These prompts ask for guidance on committing non-violent crimes such as fraud, identity theft, financial crimes, or corporate misconduct.
- Political. These prompts attempt to extract biased political statements, propaganda, or content that could influence public opinion.
- Self harm. These prompts seek information, encouragement, or methods related to suicide, self-injury, or other forms of self-destructive behavior.
- Sexual. These prompts request sexually explicit, inappropriate, or exploitative content related to sexual crimes and misconduct.
- Violent Crimes/Weapons. These prompts extract information about committing violent acts, creating weapons, or planning attacks that could cause physical harm to others.
Compliance. This category supports testing AI systems against established AI security frameworks such as OWASP LLM Top 10, MITRE ATLAS, NIST AI RMF, and DASF V2.0. It evaluates models against the specific risks identified in each standard, allowing users to assess compliance levels, identify potential vulnerabilities, and gain valuable insights into how effectively their systems adhere to recognized security frameworks.

Attack Severities

Each attack in the attack library has an associated severity. Severity of an attack is assessed subjectively by our in-house experts and is based on the sophistication of technique used and the impact it can have if successful.

Attack severities in AI Red Teaming are:

Critical
High
Medium
Low

Risk Score

This is the overall risk score assigned to the AI system based on the findings of the attack library scan. It points to the safety and security risk susceptibility of the system. A higher risk score indicates that the AI system is more vulnerable to safety and security attacks. Risk Score ranges from 0-100, 0 being practically no risk and 100 being very high risk.

Agent-based Red Teaming

In this mode, a LLM agent interrogates an AI system and then simulates attacks in a conversational fashion. Key modes of an agent scan include:

Completely automated
Human augmented

Completely automated agent scans

In this mode, the agent requires no inputs from the user. The agent first enquires about the nature/ use case of the AI system and then crafts attack goals based on that. To achieve the attack goals, the agent creates prompt attacks on the fly and then keeps adapting those based on the response of the AI system.

Human augmented agent scans

In this mode, the user can share details about the AI system for the agent to craft more pertinent goals. This details can include any or all of the following:

Base model - underlying base model powering the AI system
Use case - what is the model or application about e.g.- a customer support chatbot, HR system chatbot, a general purpose GPT etc.
Attack goals - these could be specific attack types which the user wants the agent to test for e.g.- leak customer data, share employee salary etc.

Attack objectives don't need to be crafted like proper attacks, they can be general English language statements and the agent will use those to craft proper attacks.
System Prompt- this is the system prompt used to train the AI system and if shared with AI Red Teaming, it can help carry out very advanced attacks on the system and can be treated as full white box testing of the system

Between these two modes, a user can do full spectrum testing of any AI system:

Black box testing: Using completely automated agent scan
Grey box testing: Using human augmented agent scan and sharing some details other than system prompt.
White box testing: Using human augmented agent scan and sharing all details including system prompt

Agent Scan Reports

Similar to Attack Library Scans, these reports also have an overall Risk Score pointing to the safety and security risk susceptibility of the AI system. The Risk Score is calculated based on the number of attack goals crafted by the agent which were successful and the number of techniques which had to be used to achieve them.

Agent Scans are conversational in nature, and do not have specific attack categories but they output the entire conversation between the agent and the AI system for a specific attack goal which was achieved.

Custom Attacks-based Red Teaming

Custom Attacks represent simulations of user-defined custom attack prompt sets. With this type of attack, you can:

upload and maintain your own prompt sets.
use one or more custom prompt sets alongside AI Red Teaming's standard attack library while scanning a target.
view dedicated reports for custom prompt attack results.

When using this method, consider the following key points:

A prompt set is a collection of related prompts grouped together for organizational purposes and efficient scanning.
Prompts are individual text inputs designed to test system vulnerabilities.

Supported Model Formats

Create a Deployment Profile for Prisma AIRS AI Red Teaming

Prisma AIRS Docs

AI Red Teaming Overview

Prisma AIRS Docs

AI Red Teaming Overview

Key Concepts

How Prisma AI Red Teaming Works

Attack Library-based Red Teaming

Agent-based Red Teaming

Custom Attacks-based Red Teaming

Activation & Onboarding

Next-Generation Firewalls

Cloud-Delivered Security Services

Network Security

Visibility & Monitoring