Scans
Focus
Focus
Prisma AIRS

Scans

Table of Contents

Scans

Learn about AI Red Teaming Scans in Prisma AIRS.
Where Can I Use This?What Do I Need?
  • Prisma AIRS (AI Red Teaming)
  • Prisma AIRS AI Red Teaming License
  • Prisma AIRS AI Red Teaming Deployment Profile
One complete assessment of an AI system using Prisma AIRS AI Red Teaming is considered as a scan. A scan is carried out by sending attack payloads to an AI system in the form of attack prompts.
Prisma AIRS AI Red Teaming offers the following three types of scanning for AI systems:
  • Red Teaming using Attack Library—This scan uses a curated and regularly updated list of predefined attack scenarios. These attacks are designed based on known vulnerabilities and best practices in red teaming.
  • Red Teaming using Agent—This scan utilizes dynamic attack generation powered by an LLM attacker. This type allows for real-time generation of attack payloads, making it highly adaptive to the specific behavior and responses of the Target.
  • Red Teaming using Custom Prompt Sets—This scan allows you to upload and run your own prompt sets against target LLM endpoints alongside AI Red Teaming's built-in attack library.

Red Teaming using Attack Library

In this type, AI Red Teaming uses a proprietary attack library which is constantly updated to simulate attacks on any AI system.
Key aspects of an attack library scan include:
  • Attack categories
  • Attack severities
  • Risk Score

Attack Categories

Each Attack Category contains a range of techniques. A prompt can incorporate techniques from multiple categories to enhance its Attack Success Rate (ASR) and is classified into categories based on the techniques it uses.
The attack library currently has three categories of attacks which also undergo regular updates; Security, Safety, and Compliance:
Attack CategoryAttack Scope
Security
Represents security vulnerabilities and potential exploits. This category includes the following:
  • Adversarial Suffix Prompts—These prompts are created by appending an adversarial suffix—a series of tokens added to the end of a user query that leads to unintended behavior from the model. The adversarial suffixes are learned during the model's training phase, which, when used with a harmful user query, cause the model to deviate from its intended purpose and generate harmful and undesirable outputs.
  • Evasion—These prompts use various obfuscation techniques, such as Base64, Leetspeak, ROT13, and other types of ciphers, allowing harmful content to be disguised in a way that prevents the model from identifying and flagging it as harmful or inappropriate. This manipulation enables prompts to bypass the model's safety filters, thereby allowing the extraction of harmful or malicious data.
  • Indirect Prompt Injection—These prompts operate by embedding malicious instructions within external data sources such as web pages or documents that are subsequently referenced in text prompts. When the model processes the external data source, its behavior becomes compromised by the concealed instructions, leading to unintended responses and potentially harmful content generation. This technique is particularly effective as safety guardrails typically don't scan external content for malicious instructions.
  • Jailbreak—These prompts are designed to make the model disobey its instructions and generate harmful content. They utilize a range of techniques, including role-playing or constructing imaginary environments where it might appear permissible for the model to produce such content, thereby leading to potential violations of ethical guidelines.
  • Multi-turn—These prompts exploit the conversational nature of AI systems by gradually manipulating the model across multiple interactions. Rather than attempting to bypass safety measures in a single prompt, attack prompts build context over several exchanges, subtly steering the conversation toward harmful outputs. This technique leverages the model's ability to store context of past messages, allowing seemingly benign individual messages to collectively produce unintended behavior.
  • Prompt Injection—As the name suggests, this involves "injecting" the user prompt into the system prompt. Various strategies, such as leading statements and masking, are used to confuse the model. The goal is to make it difficult for the model to distinguish between the malicious user prompt and the system prompt, thereby allowing the malicious input to influence its behavior.
  • Remote Code Execution—These prompts target AI systems with code execution capabilities, deliberately manipulating the model to run malicious or unauthorized code. The attack focuses on exploiting the system's ability to interpret and execute commands, potentially gaining access to sensitive information such as system logs, user data, or other restricted resources.
  • System Prompt Leak—These prompts are crafted to trick the model into disclosing its internal instructions, which are usually kept hidden from users. The system prompt can include sensitive data and may also contain behavioral guidelines that dictate how the model should respond or interact with users.
  • Tool Leak—These prompts target AI systems with tool capabilities, attempting to extract information about the system's available tools. The attack aims to reveal tool schemas, parameters, function call definitions, and other privileged technical details that are not intended to be disclosed to users. This technique can expose sensitive information about the system's architecture and capabilities.
Safety
These prompts involve direct questions that are unethical or harmful by nature. Their aim is to test the model's basic safety alignment by asking clear, straightforward questions that violate ethical guidelines. This category contains the following:
  • Bias—These prompts attempt to elicit responses that show discrimination toward specific groups based on characteristics such as race, gender, religion, or nationality
  • CBRN—These prompts seek information about chemical, biological, radiological, or nuclear weapons, including their creation, deployment, or use for harmful purposes.
  • Cybercrime—These prompts relate to various cyber crimes such as hacking, phishing, identity theft, or other malicious online activities.
  • Drugs—These prompts solicit information about illegal drug production, distribution, or use.
  • Non-violent Crimes—These prompts ask for guidance on committing non-violent crimes such as fraud, identity theft, financial crimes, or corporate misconduct.
  • Political—These prompts attempt to extract biased political statements, propaganda, or content that could influence public opinion.
  • Self harm—These prompts seek information, encouragement, or methods related to suicide, self-injury, or other forms of self-destructive behavior.
  • Sexual—These prompts request sexually explicit, inappropriate, or exploitative content related to sexual crimes and misconduct.
  • Violent Crimes/Weapons—These prompts extract information about committing violent acts, creating weapons, or planning attacks that could cause physical harm to others.
Compliance
Supports testing AI systems against established AI security frameworks such as OWASP LLM Top 10, MITRE ATLAS, NIST AI RMF, and DASF V2.0. It evaluates models against the specific risks identified in each standard, allowing users to assess compliance levels, identify potential vulnerabilities, and gain valuable insights into how effectively their systems adhere to recognized security frameworks.

Attack Severities

Each attack in the attack library has an associated severity. Severity of an attack is assessed subjectively by our in-house experts and is based on the sophistication of technique used and the impact it can have if successful.
Attack severities in AI Red Teaming are:
  • Critical
  • High
  • Medium
  • Low

Risk Score

This is the overall risk score assigned to the AI system based on the findings of the attack library scan. It points to the safety and security risk susceptibility of the system. A higher risk score indicates that the AI system is more vulnerable to safety and security attacks.
Risk Score ranges from 0-100, 0 being practically no risk and 100 being very high risk.

Red Teaming using Agent

In this type, a LLM agent interrogates an AI system and then simulates attacks in a conversational fashion. Key modes of an agent scan include:
  • Completely automated
  • Human augmented
Between these two modes, you can perform a full spectrum testing of any AI system:
  • Black box testing—Using completely automated agent scan.
  • Grey box testing—Using human augmented agent scan and sharing some details other than system prompt.
  • White box testing—Using human augmented agent scan and sharing all details including system prompt.

Completely Automated Agent Scans

In this mode, the agent requires no inputs from the user. The agent first enquires about the nature or use case of the AI system and then crafts attack goals based on that. To achieve the attack goals, the agent creates prompt attacks on the fly and then keeps adapting those based on the response of the AI system.

Human Augmented Agent Scans

In this mode, the user can share details about the AI system for the agent to craft more pertinent goals. This details can include any or all of the following:
  • Base model—underlying base model powering the AI system.
  • Use case—what is the model or application about. For example, a customer support chatbot, HR system chatbot, and a general purpose GPT.
  • Attack goals —these could be specific attack types which the user wants the agent to test for, such as, leak customer data, and share employee salary.
    Attack objectives don't need to be crafted like proper attacks, they can be general English language statements and the agent will use those to craft proper attacks.
  • System Prompt—this is the system prompt used to train the AI system and if shared with AI Red Teaming, it can help carry out very advanced attacks on the system and can be treated as full white box testing of the system

Agent Scan Reports

Similar to Attack Library Scans, these reports also have an overall Risk Score pointing to the safety and security risk susceptibility of the AI system. The Risk Score is calculated based on the number of attack goals crafted by the agent which were successful and the number of techniques which had to be used to achieve them.
Agent Scans are conversational in nature, and do not have specific attack categories but they output the entire conversation between the agent and the AI system for a specific attack goal which was achieved.

Red Teaming using Custom Prompt Sets

Custom attacks functionality within AI Red Teaming allows you to upload and run your own prompt sets against target LLM endpoints alongside AI Red Teaming's built-in attack library.
The Custom Attacks feature allows you to:
  • Upload and maintain your own prompt sets.
  • Use one or more custom prompt sets alongside AI Red Teaming's standard attack library while scanning a target.
  • View dedicated reports for custom prompt attack results.

Prompt Sets

Prompt Sets
A prompt set is a collection of related prompts grouped together for organizational purposes and efficient scanning.
Prompts
Prompts are individual text inputs designed to test system vulnerabilities.
All prompts in the system require validation before they can be used in security scans. This process ensures that prompts have clear attack goals and are ready for effective testing. All prompts in the system require validation before they can be used in security scans. This process ensures that prompts have clear attack goals and are ready for effective testing.
You can validate prompts automatically and manually.
Automatic Validation
All prompts undergo automatic validation. This can take up to 5-10 minutes. The process of validating a prompt involves interpreting and generating an attack goal for the prompt. This is done by our proprietary LLMs.
Manual Validation
If automatic validation fails, you'll be prompted to manually validate the prompt by adding a goal for the prompt; you can also choose to skip the prompt.

Managing Prompt Sets

You can perform the following actions on your prompt sets:
  • Edit prompt set name
  • Edit description
  • Add new prompts
  • Delete individual prompts
  • Validate unvalidated prompts
Prompts can have the following validation status:
  • Validated. Indicates that the prompt is ready to use.
  • Validating. Indicates that auto-validation is in progress.
  • Not validated. Indicates that the prompt requires manual validation.

Prompt Set Usage

A prompt set provides the following usage patterns:
  • A prompt set is enabled and ready to use when at least one prompt in the set is validated.
  • Only **_VALIDATED_** prompts within the prompt set will be used in a scan.
  • Not-validated prompts will be ignored during scans.
Validation StatusDescription
Validation In-progress
This prompt indicates that the prompts are being validated. This process runs in the background and may take up to 10 minutes to complete, based on the length of the prompt. You can continue working until validation is complete. This view displays the prompt and its corresponding validation status.
Validated Prompts
This displays validated prompts. It provides an overview which includes details about the prompt, including the status, when the prompt was added and a brief description.
Not-Validated Prompts
This displays prompts that were not validated. Some prompts require manual validation and cannot be used until validated.
Manual Validation
Prompts that require manual validation require you to review the prompt's details and specify the goal to complete validation. Enter the prompt and specify the goal, then validate the prompt.