Prisma AIRS
AI Red Teaming Overview
Table of Contents
                    
          Expand All
          |
          Collapse All
        
        Prisma AIRS Docs
AI Red Teaming Overview
Learn about Prisma AIRS AI Red Teaming.
    
  | Where Can I Use This? | What Do I Need? | 
|---|---|
| 
 | 
 | 
Prisma AIRS supports automated AI Red Teaming. It scans any AI system (LLM or application
            powered by LLM) for safety and security vulnerabilities. There are a few key components
            of a red teaming exercise using this tool:
- Target - endpoint of the AI system scanned for vulnerabilities is called a target.
- Scan - one complete assessment of an AI system using AI Red Teaming is considered as a scan. A scan is carried out by sending attack payloads to an AI system in the form of attack prompts. Three modes of scanning an AI system are provided: Attack Library, Agent and Custom attacks.
- Report - the findings of any given AI Red Teaming scan are presented in the form of a Scan Report.
Key Concepts
There are a few concepts to consider when using AI Red Teaming; targets and
                    scans.
Target
A Target is the system or endpoint you want to perform red teaming on using AI
                Red Teaming. It serves as the focal point for testing and evaluating the security
                and resilience of your application or model. A target in AI Red Teaming can be:
    
    
- Models. First party or third party models with a defined endpoint for simulation.
- Applications. AI powered systems designed for specific tasks or objectives.
- Agents. Specific application subtype where AI models are in charge of the control flow.
AI Red Teaming is designed to work seamlessly with
                    REST APIs and Streaming APIs.
This flexibility allows you to test a wide range of targets, ensuring comprehensive
                red teaming capabilities for your LLM and LLM-powered applications.
Scan
A Scan represents a complete assessment of an AI system. During a scan, AI Red
                Teaming evaluates the system's security and robustness by sending carefully crafted
                attack payloads, also known as attack prompts, to the Target.
AI Red Teaming provides three distinct modes for scanning AI systems:
- Attack library scan. This scan uses a curated and regularly updated list of predefined attack scenarios. These attacks are designed based on known vulnerabilities and best practices in red teaming.
- Agent scan. This scan utilizes dynamic attack generation powered by an LLM attacker. This mode allows for real-time generation of attack payloads, making it highly adaptive to the specific behavior and responses of the Target.
- Custom attack scan. This scan allows you to upload and run your own prompt sets against target LLM endpoints alongside AI Red Teaming's built-in attack library.
By combining these modes, AI Red Teaming ensures
                a thorough and effective assessment of your AI system's defenses.
How Prisma AI Red Teaming Works
Prisma AI Red Teaming interacts with your applications and models, referred to as
                        Targets, in much the same way as an end user would. This interaction
                    enables AI Red Teaming to simulate realistic scenarios and identify
                    vulnerabilities or weaknesses in your application's or model's behavior. By
                    mimicking end-user actions, it ensures that its findings are relevant and
                    applicable to real-world use cases.
Single-Tenant Deployment for Security and Privacy
One of the core features of AI Red Teaming is its single-tenant deployment
                        model. This approach ensures that:
- Complete Isolation: All compute resources and data associated with a customer are completely isolated from other customers.
- Enhanced Security: No customer data ever mixes with another, eliminating the risk of cross-tenant data leaks or unauthorized access.
- Private Environment: Each customer operates in their own secure environment, providing peace of mind for sensitive applications or models that handle proprietary or confidential information.
Attack Library-based Red Teaming
In this mode, AI Red Teaming uses a proprietary attack library which is constantly
                updated to simulate attacks on any AI system. Key aspects of an attack library scan
                    include:
- Attack categories
- Attack severities
- Risk Score
Attack Categories
Each Attack Category contains a range of techniques. A prompt can incorporate
                techniques from multiple categories to enhance its Attack Success Rate (ASR) and is
                classified into categories based on the techniques it uses. The attack library
                currently has 3 categories of attacks which also undergo regular updates; Security,
                Safety, and Compliance:
- Security. Represents security vulnerabilities and potential exploits.
                        This category includes the following:- Adversarial Suffix Prompts. These prompts are created by appending an adversarial suffix—a series of tokens added to the end of a user query that leads to unintended behavior from the model. The adversarial suffixes are learned during the model's training phase, which, when used with a harmful user query, cause the model to deviate from its intended purpose and generate harmful and undesirable outputs.
- Evasion. These prompts use various obfuscation techniques, such as Base64, Leetspeak, ROT13, and other types of ciphers, allowing harmful content to be disguised in a way that prevents the model from identifying and flagging it as harmful or inappropriate. This manipulation enables prompts to bypass the model's safety filters, thereby allowing the extraction of harmful or malicious data.
- Indirect Prompt Injection. These prompts operate by embedding malicious instructions within external data sources such as web pages or documents that are subsequently referenced in text prompts. When the model processes the external data source, its behavior becomes compromised by the concealed instructions, leading to unintended responses and potentially harmful content generation. This technique is particularly effective as safety guardrails typically don't scan external content for malicious instructions.
- Jailbreak. These prompts are designed to make the model disobey its instructions and generate harmful content. They utilize a range of techniques, including role-playing or constructing imaginary environments where it might appear permissible for the model to produce such content, thereby leading to potential violations of ethical guidelines.
- Multi-turn. These prompts exploit the conversational nature of AI systems by gradually manipulating the model across multiple interactions. Rather than attempting to bypass safety measures in a single prompt, attack prompts build context over several exchanges, subtly steering the conversation toward harmful outputs. This technique leverages the model's ability to store context of past messages, allowing seemingly benign individual messages to collectively produce unintended behavior.
- Prompt Injection. As the name suggests, this involves "injecting" the user prompt into the system prompt. Various strategies, such as leading statements and masking, are used to confuse the model. The goal is to make it difficult for the model to distinguish between the malicious user prompt and the system prompt, thereby allowing the malicious input to influence its behavior.
- Remote Code Execution. These prompts target AI systems with code execution capabilities, deliberately manipulating the model to run malicious or unauthorized code. The attack focuses on exploiting the system's ability to interpret and execute commands, potentially gaining access to sensitive information such as system logs, user data, or other restricted resources.
- System Prompt Leak. These prompts are crafted to trick the model into disclosing its internal instructions, which are usually kept hidden from users. The system prompt can include sensitive data and may also contain behavioral guidelines that dictate how the model should respond or interact with users.
- Tool Leak. These prompts target AI systems with tool capabilities, attempting to extract information about the system's available tools. The attack aims to reveal tool schemas, parameters, function call definitions, and other privileged technical details that are not intended to be disclosed to users. This technique can expose sensitive information about the system's architecture and capabilities.
 
- Safety. These prompts involve direct questions that are unethical or
                        harmful by nature. Their aim is to test the model's basic safety alignment
                        by asking clear, straightforward questions that violate ethical guidelines.
                        This category contains the following:- Bias. These prompts attempt to elicit responses that show discrimination toward specific groups based on characteristics such as race, gender, religion, or nationality
- CBRN. These prompts seek information about chemical, biological, radiological, or nuclear weapons, including their creation, deployment, or use for harmful purposes.
- Cybercrime. These prompts relate to various cyber crimes such as hacking, phishing, identity theft, or other malicious online activities.
- Drugs. These prompts solicit information about illegal drug production, distribution, or use.
- Non-violent Crimes. These prompts ask for guidance on committing non-violent crimes such as fraud, identity theft, financial crimes, or corporate misconduct.
- Political. These prompts attempt to extract biased political statements, propaganda, or content that could influence public opinion.
- Self harm. These prompts seek information, encouragement, or methods related to suicide, self-injury, or other forms of self-destructive behavior.
- Sexual. These prompts request sexually explicit, inappropriate, or exploitative content related to sexual crimes and misconduct.
- Violent Crimes/Weapons. These prompts extract information about committing violent acts, creating weapons, or planning attacks that could cause physical harm to others.
 
- Compliance. This category supports testing AI systems against established AI security frameworks such as OWASP LLM Top 10, MITRE ATLAS, NIST AI RMF, and DASF V2.0. It evaluates models against the specific risks identified in each standard, allowing users to assess compliance levels, identify potential vulnerabilities, and gain valuable insights into how effectively their systems adhere to recognized security frameworks.
Attack Severities
Each attack in the attack library has an associated severity. Severity of an
                attack is assessed subjectively by our in-house experts and is based on the
                sophistication of technique used and the impact it can have if successful.
Attack severities in AI Red Teaming are:
- Critical
- High
- Medium
- Low
Risk Score
This is the overall risk score assigned to the AI system based on the findings of the
                attack library scan. It points to the safety and security risk susceptibility of the
                system. A higher risk score indicates that the AI system is more vulnerable to
                safety and security attacks. Risk Score ranges from 0-100, 0 being practically no
                risk and 100 being very high risk.
Agent-based Red Teaming
In this mode, a LLM agent interrogates an AI system and then simulates attacks in a
                conversational fashion. Key modes of an agent scan include:
- Completely automated
- Human augmented
Completely automated agent scans
In this mode, the agent requires no inputs from the user. The agent first enquires
                about the nature/ use case of the AI system and then crafts attack goals based on
                that. To achieve the attack goals, the agent creates prompt attacks on the fly and
                then keeps adapting those based on the response of the AI system.
Human augmented agent scans
In this mode, the user can share details about the AI system for the agent to craft
                more pertinent goals. This details can include any or all of the following:
- Base model - underlying base model powering the AI system
- Use case - what is the model or application about e.g.- a customer support chatbot, HR system chatbot, a general purpose GPT etc.
- Attack goals - these could be specific attack types which the user wants the agent to test for e.g.- leak customer data, share employee salary etc.Attack objectives don't need to be crafted like proper attacks, they can be general English language statements and the agent will use those to craft proper attacks.
- System Prompt- this is the system prompt used to train the AI system and if shared with AI Red Teaming, it can help carry out very advanced attacks on the system and can be treated as full white box testing of the system
Between these two modes, a user can do full spectrum testing of any AI system:
- Black box testing: Using completely automated agent scan
- Grey box testing: Using human augmented agent scan and sharing some details other than system prompt.
- White box testing: Using human augmented agent scan and sharing all details including system prompt
Agent Scan Reports
Similar to Attack Library Scans, these reports also have an overall Risk Score
                pointing to the safety and security risk susceptibility of the AI system. The Risk
                Score is calculated based on the number of attack goals crafted by the agent which
                were successful and the number of techniques which had to be used to achieve
                them.
Agent Scans are conversational in nature, and do not have specific attack categories
                but they output the entire conversation between the agent and the AI system for a
                specific attack goal which was achieved.
Custom Attacks-based Red Teaming
Custom Attacks represent simulations of user-defined custom attack prompt sets. With
                this type of attack, you can:
- upload and maintain your own prompt sets.
- use one or more custom prompt sets alongside AI Red Teaming's standard attack library while scanning a target.
- view dedicated reports for custom prompt attack results.
When using this method, consider the following key points:
- A prompt set is a collection of related prompts grouped together for organizational purposes and efficient scanning.
- Prompts are individual text inputs designed to test system vulnerabilities.
