API Rate Limiting for AI Runtime Security's
Scan API controls the volume and
frequency of API requests made by individual tenants. This mechanism
enforces per-tenant limits on both the number of requests and the volume of
tokens processed, ensuring equitable resource distribution and service
stability in your environment. Without rate limiting, a single tenant could
consume excessive API capacity, degrading service quality for other tenants
sharing the same infrastructure. This feature mitigates that risk by
enforcing limits derived from your tenant's subscription, preventing "noisy
neighbor" issues and ensuring fair resource allocation.
Per-tenant limits have an allocated cap on
requests-per-second (RPS) and tokens-per-minute consumed by the AI Runtime
API. By default, for each tenant, rate limits of 150 RPS and 15 million
tokens per minute are enforced. Contact your Palo Alto Networks account team
to request additional allocated throughput.
To ensure service stability, requests
that arrive in short bursts may be throttled even if the overall rate limit
has not been reached. Palo Alto Networks recommends distributing requests
evenly over time for best results.
How it Works
The AI Runtime Security API Rate Limiting feature introduces a systematic
mechanism to enforce per-tenant limits on Scan API usage to ensure service
stability and fair resource distribution. Currently, the system lacks
enforcement, allowing single tenants to potentially degrade service for
others. This solution addresses that by:
Automated Scaling: Automatically calculating Request Rate
(RPS) and Volume (TPM) limits based on a tenant’s monthly billion
token subscription.
Precision Control: Utilizing an Apigee gateway layer to
enforce limits at the Auth Code level, with built-in "floor" bounds
to ensure even small tenants maintain a minimum viable service
level.
Operational Flexibility: Providing operators with the
ability to manually override limits or tune specific parameters
(like peak factors and burst allowances) for customers with unique
traffic profiles.
Safe Rollout: Implementing a phased "Shadow Mode" approach
to observe real-world traffic patterns before moving to gradual,
cohort-based enforcement.
The auto-calculation formula is designed
to translate a tenant's subscription into concrete technical limits (RPS and
TPM) while accounting for traffic spikes and data density.