LLM capacity is a limited resource at the industry level, and all providers (Azure, OpenAI, AWS Bedrock, Google Cloud Vertex, etc.) limit the maximum capacity available per account. Palantir AIP consequently follows the market-level constraint set introduced by LLM providers. The standard unit of measure across the industry is tokens per minute (TPM) and requests per minute (RPM).
Palantir has set a certain maximum capacity for each enrollment, referred to as “enrollment-level rate limits”. This capacity is measured per model using TPM and RPM, and includes all models of all providers enabled on your enrollment, including GPT, Claude, Gemini, Llama, Mixtral, and more. In this way, each model has a separate, independent capacity not affected by the usage of other models.
By default, all customers are on the medium tier, which is large enough to build prototypes and scale to a few use cases, even with hundreds of users and large datasets, including millions of documents for example.
Additionally, AIP offers the option to upgrade the enrollment capacity from the medium tier to a large or XL tier if you require additional capacity. If you are constantly hitting enrollment rate limits blocking you from expanding your AIP usage, or if you expect you will increase the volume of your pipelines or total number of users, contact Palantir Support.
Enrollment limits are now displayed on the AIP rate limits tab in the Resource Management application, along with the enrollment tier.
AIP offers enough capacity to build large scale workflows with enrollment tiers, particularly the XL tier. These tiers have provided enough capacity for hundreds of Palantir customers using LLMs at scale, and we continue to increase these limits.
The table below contains enrollment limits for tokens per minute (TPM) and requests per minute (RPM) for each enrollment tier. For enrollments with both Azure and OpenAI enabled, enrollment limits will be double what is shown below for Azure and OpenAI. Additionally, for enrollments geo-restricted to a single region, TPM and RPM may be lower than the table indicates in the Large and X-large tiers.
Model Backends | Model Name | Per-user Limits | Small Tier | Medium Tier | Large Tier | XLarge Tier |
---|---|---|---|---|---|---|
Amazon Bedrock | Claude 3 Sonnet | 200K TPM 300 RPM | 50K TPM 100 RPM | 450K TPM 500 RPM | 675K TPM 750 RPM | 900K TPM 1K RPM |
Amazon Bedrock | Claude 3 Haiku | 270K TPM 770 RPM | 60K TPM 250 RPM | 600K TPM 1K RPM | 1.5M TPM 1.5K RPM | 2M TPM 2K RPM |
Amazon Bedrock | Claude 3.5 Sonnet | 230K TPM 120 RPM | 50K TPM 60 RPM | 1M TPM 300 RPM | 1.5M TPM 450 RPM | 2M TPM 600 RPM |
Amazon Bedrock Google Vertex | Claude 3.5 Haiku | 500K TPM 1K RPM | 100K TPM 400 RPM | 1.2M TPM 1.8K RPM | 1.8M TPM 2.7K RPM | 2.4M TPM 3.6K RPM |
Amazon Bedrock Google Vertex | Claude 3.5 Sonnet V2 | 200K TPM 60 RPM | 30K TPM 20 RPM | 300K TPM 100 RPM | 500K TPM 200 RPM | 600K TPM 300 RPM |
Amazon Bedrock Google Vertex | Claude 3.7 Sonnet | 300K TPM 100 RPM | 100K TPM 40 RPM | 1M TPM 400 RPM | 2M TPM 600 RPM | 3M TPM 1K RPM |
Amazon Bedrock Google Vertex | Claude 4 Sonnet | 300K TPM 25 RPM | 50K TPM 10 RPM | 1M TPM 100 RPM | 2M TPM 250 RPM | 3M TPM 500 RPM |
Palantir Hub | Code Llama 2 13b Instruct | 50K TPM 100 RPM | 60K TPM 150 RPM | 300K TPM 450 RPM | 450K TPM 675 RPM | 600K TPM 900 RPM |
Palantir Hub | Llama 2 13b Chat | 50K TPM 100 RPM | 60K TPM 150 RPM | 300K TPM 450 RPM | 450K TPM 675 RPM | 600K TPM 900 RPM |
Palantir Hub | Llama 2 70b Chat | 50K TPM 100 RPM | 60K TPM 150 RPM | 300K TPM 450 RPM | 450K TPM 675 RPM | 600K TPM 900 RPM |
Palantir Hub | Llama 3 70b Instruct | 50K TPM 100 RPM | 60K TPM 150 RPM | 300K TPM 450 RPM | 450K TPM 675 RPM | 600K TPM 900 RPM |
Palantir Hub | Llama 3 8b Instruct | 50K TPM 100 RPM | 60K TPM 150 RPM | 300K TPM 450 RPM | 450K TPM 675 RPM | 600K TPM 900 RPM |
Palantir Hub | Llama 3.1 8b Instruct | 50K TPM 100 RPM | 60K TPM 150 RPM | 300K TPM 450 RPM | 450K TPM 675 RPM | 600K TPM 900 RPM |
Palantir Hub | Llama 3.1 70b Instruct | 50K TPM 100 RPM | 60K TPM 150 RPM | 300K TPM 450 RPM | 450K TPM 675 RPM | 600K TPM 900 RPM |
Palantir Hub | Llama 3.3 70b Instruct | 50K TPM 100 RPM | 60K TPM 150 RPM | 300K TPM 450 RPM | 450K TPM 675 RPM | 600K TPM 900 RPM |
xAI | Grok-2 | 200K TPM 300 RPM | 60K TPM 150 RPM | 2M TPM 400 RPM | 3M TPM 700 RPM | 4M TPM 1K RPM |
xAI | Grok-2 Vision | 200K TPM 300 RPM | 60K TPM 150 RPM | 2M TPM 400 RPM | 3M TPM 700 RPM | 4M TPM 1K RPM |
xAI | Grok 3 | 100K TPM 100 RPM | 60K TPM 10 RPM | 1M TPM 50 RPM | 2M TPM 250 RPM | 3M TPM 500 RPM |
xAI | Grok 3 Mini (with Thinking) | 50K TPM 100 RPM | 60K TPM 10 RPM | 600K TPM 50 RPM | 1M TPM 100 RPM | 1.2M TPM 150 RPM |
Palantir Hub | Schematic 7B | 50K TPM 100 RPM | 60K TPM 150 RPM | 300K TPM 450 RPM | 450K TPM 675 RPM | 600K TPM 900 RPM |
Palantir Hub | Document Information Extraction | 1M TPM 40 RPM | 1M TPM 40 RPM | 1M TPM 200 RPM | 2M TPM 400 RPM | 600K TPM 900 RPM |
Palantir Hub | Snowflake Arctic Embed Medium | 500K TPM 500 RPM | 60K TPM 150 RPM | 300K TPM 450 RPM | 450K TPM 675 RPM | 600K TPM 900 RPM |
Palantir Hub | Mistral 7B Instruct | 50K TPM 100 RPM | 60K TPM 150 RPM | 300K TPM 450 RPM | 450K TPM 675 RPM | 600K TPM 900 RPM |
Azure OpenAI | GPT-4 Turbo | 50K TPM 100 RPM | 60K TPM 120 RPM | 375K TPM 450 RPM | 562.5K TPM 675 RPM | 750K TPM 900 RPM |
Azure OpenAI Direct OpenAI | GPT-4o | 300K TPM 800 RPM | 60K TPM 150 RPM | 1M TPM 1K RPM | 1.5M TPM 2K RPM | 3M TPM 4K RPM |
Azure OpenAI Direct OpenAI | GPT-4o mini | 300K TPM 800 RPM | 60K TPM 150 RPM | 1M TPM 1K RPM | 1.5M TPM 2K RPM | 3M TPM 4K RPM |
Azure OpenAI Direct OpenAI | GPT-4.1 | 300K TPM 1K RPM | 100K TPM 100 RPM | 1M TPM 1K RPM | 2M TPM 2K RPM | 4M TPM 4K RPM |
Azure OpenAI Direct OpenAI | GPT-4.1 mini | 300K TPM 1K RPM | 100K TPM 100 RPM | 1M TPM 1K RPM | 2M TPM 2K RPM | 4M TPM 4K RPM |
Azure OpenAI Direct OpenAI | GPT-4.1 nano | 300K TPM 1K RPM | 100K TPM 100 RPM | 1M TPM 1K RPM | 2M TPM 2K RPM | 4M TPM 4K RPM |
Azure OpenAI | o1 | 600K TPM 5 RPM | 100K TPM 10 RPM | 250K TPM 25 RPM | 400K TPM 40 RPM | 750K TPM 75 RPM |
Azure OpenAI Direct OpenAI | o1-mini | 600K TPM 10 RPM | 100K TPM 10 RPM | 250K TPM 25 RPM | 400K TPM 40 RPM | 750K TPM 75 RPM |
Azure OpenAI | o3-mini | 300K TPM 1K RPM | 100K TPM 100 RPM | 1M TPM 1K RPM | 2M TPM 2K RPM | 4M TPM 4K RPM |
Azure OpenAI Direct OpenAI | o3 | 300K TPM 100 RPM | 100K TPM 10 RPM | 1M TPM 1K RPM | 2M TPM 2K RPM | 4M TPM 4K RPM |
Azure OpenAI Direct OpenAI | o4-mini | 300K TPM 100 RPM | 100K TPM 100 RPM | 1M TPM 1K RPM | 2M TPM 2K RPM | 4M TPM 4K RPM |
Azure OpenAI Direct OpenAI | text-embedding-ada-002 | 1M TPM 1.5K RPM | 450K TPM 3K RPM | 2.1M TPM 4.5K RPM | 3.1M TPM 6.8K RPM | 4.2M TPM 9K RPM |
Azure OpenAI Direct OpenAI | Text Embedding 3 Small | 1M TPM 1.5K RPM | 60K TPM 400 RPM | 300K TPM 2K RPM | 450K TPM 3K RPM | 600K TPM 6K RPM |
Azure OpenAI Direct OpenAI | Text Embedding 3 Large | 1M TPM 1.5K RPM | 60K TPM 400 RPM | 1M TPM 2K RPM | 2M TPM 3K RPM | 3M TPM 6K RPM |
Google Vertex | Gemini 1.5 Flash | 300K TPM 200 RPM | 60K TPM 150 RPM | 2M TPM 600 RPM | 3M TPM 1.2K RPM | 4M TPM 2K RPM |
Google Vertex | Gemini 1.5 Pro | 300K TPM 200 RPM | 60K TPM 150 RPM | 2M TPM 400 RPM | 3M TPM 700 RPM | 4M TPM 1K RPM |
Google Vertex | Gemini 2.0 Flash | 300K TPM 200 RPM | 60K TPM 150 RPM | 2M TPM 600 RPM | 3M TPM 1.2K RPM | 4M TPM 2K RPM |
Enrollment administrators can navigate to the AIP usage & limits page in the Resource Management application to:
View usage: View LLM token and request usage of all Palantir-provided models for all Projects and resources in your enrollment.
Manage rate limits: Configure the maximum percent of TPM and RPM that all resources within a given Project can utilize at every given minute combined, per model.
The View usage tab provides visibility into LLM token and request usage of all Palantir-provided models for all Projects and resources in your enrollment. Administrators can use this view to better manage LLM capacity and handle rate limits.
This view allows you to:
Note that this view is not optimized to address cost management for LLM usage. Learn how to review LLM cost on AIP-enabled enrollments via the Analysis tab.
If you are hitting rate limits at the enrollment or Project level, you may consider taking any of the following actions:
On the Manage rate limits tab, you have the flexibility to maximize LLM utilization for production use cases in case of ambitious use cases in AIP, and limit or disallow experimental projects from saturating the entire enrollment capacity. Enrollment administrators can configure the maximum percent of TPM and RPM that all resources within a given Project can utilize at every given minute combined, per model.
By default, all Projects are given a specific limit at which to operate. An admin can create additional Project limits, define which Projects are included in each limit, and what percent of enrollment capacity can be used.
Reserved capacity is an AIP LLM capacity management tool in Resource Management. Reserved capacity can secure tokens per minute (TPM) and requests per minute (RPM) for production workflows in addition to existing enrollment capacity. This aims to secure critical production workflows that should not be limited by project rate limits, enrollment limits, and other resources that compete over the same pool of tokens and RPM.
We cannot guarantee the availability of reserved capacity for all models at all times. This depends on the availability and offerings of model providers such as Azure, AWS, GCP, xAI, and others. We aim to offer reserved capacity on all industry-leading flagship models.
Reserved capacity has been sufficient for 99.9% uptime based on the performance of AIP in the past year. We cannot guarantee 100% capacity availability, but based on usage patterns in the past year, over 99% of LLM request failures were due to enrollment and project rate limits. These issues can be addressed and solved with the reserved capacity tool.
There is no extra cost for reserved capacity as a service; added costs will depend on additional token usage, as with all other LLM usage in AIP. This is subject to change in the future for new use cases or specific models. If this policy changes, we will not retroactively charge existing workflows for using reserved capacity; these workflows will continue to only incur charges based on additional token usage.
Contact your Palantir administrator to request reserved capacity allocation. Once allocated, users with resource management administrator
permissions can allocate reserved capacity to specific projects.
Consider the following example to further understand reserved capacity usage:
Use the Analysis page to view the cost of LLM usage on your AIP-enabled enrollment.
From the Analysis page, select Filter by source: All LLMs
and Group by source. This will generate a chart of daily LLM cost, segmented by model.
Generally, AIP prioritizes interactive requests over pipelines with batch requests. Interactive queries are defined as any real-time interaction that a user has with an LLM, such as AIP Assist, Workshop, Agent Studio, preview in the AIP Logic LLM board, and preview in the Pipeline Builder LLM node. Batch queries are defined as a large set of requests sent without a user expecting an immediate response, for example Transforms pipelines, Pipeline Builder, Automate (for Logic).
This principle currently guarantees that 20% of capacity at the enrollment and Project level will always be reserved for interactive queries. This means that for a 100,000 TPM capacity for a certain model, only a maximum of 80,000 TPM can be used for pipelines at any given minute, while at least 20,000 TPM (and up to 100,000 TPM) is available for interactive queries.
Consider the following example: