# Prompt Caching

Prompt Caching allows you to reduce costs and latency by reusing previously processed prompt tokens. When supported by the provider, repeated parts of your input are served from cache instead of being reprocessed.

Caching is handled transparently at the provider level, the router automatically detects cache usage in API responses and applies the correct billing.

💡 **Note:** Not all providers support prompt caching. Behavior and pricing depend on the underlying provider.

### ⚙️ How It Works

When you send a request with a large or repeated prompt:

1. The provider may cache part of the input (typically the prefix)
2. On subsequent requests with the same prefix, cached tokens are reused
3. Cached tokens are billed at a reduced rate (or sometimes free)

The router detects caching via the `prompt_tokens_details` field and splits input tokens into three categories:

| Token Type            | Description                          | Billing                 |
| --------------------- | ------------------------------------ | ----------------------- |
| Text tokens           | Fresh input tokens (not cached)      | Standard input rate     |
| Cached tokens         | Tokens served from cache (cache hit) | Reduced cache-read rate |
| Cache creation tokens | Tokens written to cache              | Cache-write rate        |

#### Cost Calculation

```
input_cost = text_tokens × input_rate
           + cached_tokens × cache_read_rate
           + cache_creation_tokens × cache_write_rate
```

### 🔍 Cache usage fields

Each API response includes a `usage` object. When caching is active, you’ll see:

```
{
  "usage": {
    "completion_tokens": 3000,
    "prompt_tokens": 15000,
    "total_tokens": 18000,
    "prompt_tokens_details": {
      "cached_tokens": 10000,
      "cache_creation_tokens": 0
    },
    "cost_details": {
      "prompt_cost": 100,
      "cache_read_cost": 150,
      "cache_write_cost": 0,
      "completion_cost": 80
    }
  }
}
```

| Field                   | Description                             |
| ----------------------- | --------------------------------------- |
| `prompt_tokens`         | Total input tokens (includes all types) |
| `cached_tokens`         | Tokens served from cache (cache hit)    |
| `cache_creation_tokens` | Tokens written to cache                 |

When `cached_tokens > 0`, caching is active and cost savings apply.

### 🏢 Supported Providers

#### Amazon Bedrock (Claude models)

Supports both implicit and explicit caching.&#x20;

You can explicitly control caching using `cache_control`:

```python
model="claude-4-6-sonnet",
messages= [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "You are an expert analyst. Use the document below to answer my question."
        },
        {
          "type": "text",
          "text": "LONG DOCUMENT: ... (thousands of tokens)",
          "cache_control": { "type": "ephemeral" }
        },
        {
          "type": "text",
          "text": "Question: What are the 3 main insights from this document?"
        }
      ]
    }
  ]
```

* Minimum tokens: \~1024 (varies by model)
* Prompts below the minimum are not cached
* **TTL:** \~5–60minutes&#x20;

#### Google Vertex (Gemini models)

Supports **automatic (implicit) caching:** no configuration required.

* Works with Gemini 2.5 Flash and Pro
* Minimum tokens:
  * Flash: \~1024
  * Pro: \~2048
* **TTL:** \~3–5 minutes&#x20;

#### Azure (OpenAI models)

* Automatic caching (cannot be disabled)
* Applies to GPT-4o and newer models
* Requirements:
  * Minimum \~1024 tokens
  * First \~1024 tokens must be identical for cache hits
* **TTL:** \~5–60minutes&#x20;

#### Mistral

* Automatic caching supported
* Cache read pricing is typically \~10% of the standard input cost

#### Inceptron

* Automatic caching supported
* Cache write may be free

#### Tensorix

* Automatic caching supported
* Cache reuse is **best-effort, not guaranteed** (cache hits depend on warm replica state and request routing across replicas)
* No explicit cache control or TTL exposed to users

#### Other Providers

Prompt caching is **not currently supported**.

### 💡 Best practices for cache usage

* **Keep prompts stable:** Ensure the beginning of your prompt remains identical across requests
* **Use large shared context:** Place reusable content (documents, instructions) at the start
* **Avoid dynamic prefixes:** Changes at the beginning of the prompt prevent cache reuse
* **Leverage system messages:** Store static instructions or context there


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.cortecs.ai/usage/prompt-caching.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
