Prompt Caching

Prompt Caching allows you to reduce costs and latency by reusing previously processed prompt tokens. When supported by the provider, repeated parts of your input are served from cache instead of being reprocessed.

Caching is handled transparently at the provider level, the router automatically detects cache usage in API responses and applies the correct billing.

💡 Note: Not all providers support prompt caching. Behavior and pricing depend on the underlying provider.

⚙️ How It Works

When you send a request with a large or repeated prompt:

  1. The provider may cache part of the input (typically the prefix)

  2. On subsequent requests with the same prefix, cached tokens are reused

  3. Cached tokens are billed at a reduced rate (or sometimes free)

The router detects caching via the prompt_tokens_details field and splits input tokens into three categories:

Token Type
Description
Billing

Text tokens

Fresh input tokens (not cached)

Standard input rate

Cached tokens

Tokens served from cache (cache hit)

Reduced cache-read rate

Cache write tokens

Tokens written to cache

Cache-write rate

Cost Calculation

input_cost = text_tokens × input_rate
           + cached_tokens × cache_read_rate
           + cache_write_tokens × cache_write_rate

🔍 Cache usage fields

Each API response includes a usage object. When caching is active, you’ll see:

Field
Description

prompt_tokens

Total input tokens (includes all types)

cached_tokens

Tokens served from cache (cache hit)

cache_write_tokens

Tokens written to cache

When cached_tokens > 0, caching is active and cost savings apply.

🏢 Supported Providers

Amazon Bedrock (Claude models)

Supports both implicit and explicit caching.

You can explicitly control caching using cache_control:

  • Minimum tokens: ~1024 (varies by model)

  • Prompts below the minimum are not cached

  • TTL: ~5–60minutes

Google Vertex (Gemini models)

Supports automatic (implicit) caching: no configuration required.

  • Works with Gemini 2.5 Flash and Pro

  • Minimum tokens:

    • Flash: ~1024

    • Pro: ~2048

  • TTL: ~3–5 minutes

Azure (OpenAI models)

  • Automatic caching (cannot be disabled)

  • Applies to GPT-4o and newer models

  • Requirements:

    • Minimum ~1024 tokens

    • First ~1024 tokens must be identical for cache hits

  • TTL: ~5–60minutes

Mistral

  • Automatic caching supported

  • Cache read pricing is typically ~10% of the standard input cost

Inceptron

  • Automatic caching supported

  • Cache write may be free

Other Providers

Prompt caching is not currently supported.

💡 Best practices for cache usage

  • Keep prompts stable: Ensure the beginning of your prompt remains identical across requests

  • Use large shared context: Place reusable content (documents, instructions) at the start

  • Avoid dynamic prefixes: Changes at the beginning of the prompt prevent cache reuse

  • Leverage system messages: Store static instructions or context there

Last updated