Prompt Cache

When developing AI applications, you may find yourself reusing the same prompts across multiple requests. The Prompt Cache feature can help improve response times and reduce API call costs by caching your frequently used prompts.

Restrictions and Limitations

Cache Limits

  • Models have a minimum cache-eligible prompt length. Attempts to cache prompts shorter than this will result in an API error.
  • Cached prompts have a lifetime (TTL) of approximately 5 minutes. This duration cannot be modified due to provider limitations.

Supported Models

Prompt Cache is currently in beta and available only for select Claude models:

ModelMinimum Cache LengthBase Input TokensCache WritesCache HitsOutput Tokens
DeepSeek Coder/$0.14 / MTok$0 / MTok$0.02 / MTok$0.28 / MTok
DeepSeek Chat/$0.14 / MTok$0 / MTok$0.02 / MTok$0.28 / MTok
Claude 3.5 Sonnet1024$3 / MTok$3.75 / MTok$0.30 / MTok$15 / MTok
Claude 3.0 Haiku2048$0.25 / MTok$0.30 / MTok$0.03 / MTok$1.25 / MTok
Claude 3.0 Opus1024$15 / MTok$18.75 / MTok$1.50 / MTok$75 / MTok

Implementation

To cache your prompts, add a cache_control object to your request payload. Currently, the only supported cache type is ephemeral. Once a prompt is cached, subsequent identical requests will utilize the cached prompt, reducing response time and API call costs.

You can include the cache_control object in subsequent requests without refreshing the cache; it will directly use the cached prompt.

Example:

{
  "model": "claude-3-5-sonnet",
  "messages": [
    {
      "role": "user",
      "content": "You are an AI assistant tasked with analyzing literary works. Your goal is to provide insightful commentary on themes, characters, and writing style.",
      "cache_control": { "type": "ephemeral" }
    }
  ]
}

Monitoring

You can track cache performance in the Analytics dashboard. The dashboard displays metrics for Cache Creation Input Tokens and Cache Read Input Tokens, allowing you to assess the effectiveness of your prompt caching strategy.