Claude Prompt Caching Deep Dive: Cut Input Costs by Reusing Stable Prefixes
Claude Prompt Caching Deep Dive: Cut Input Costs by Reusing Stable Prefixes
Claude is excellent at working with long context: large codebases, policy manuals, agent traces, tool schemas, retrieval bundles, and multi-step instructions. The downside is obvious to anyone running production workloads: if you resend the same 80,000-token prefix on every request, you pay for those input tokens again and again.
Prompt caching solves that problem.
Instead of charging full input price every time you send a stable prompt prefix, Claude can cache that prefix and let later requests reuse it at a much lower “cache read” price. Used well, prompt caching can dramatically reduce costs for coding agents, document chat, customer support copilots, research workflows, and any system where most of the prompt stays the same while the final user query changes.
This deep dive explains how Claude prompt caching works, what counts as a cache write versus a cache read, what breaks cache hits, and how to structure prompts for maximum savings.
What Claude Prompt Caching Actually Does
Prompt caching lets you mark parts of a request as reusable. Claude stores a tokenized representation of a stable prompt prefix. On later requests, if the beginning of the request matches a previously cached prefix, Claude can reuse it instead of processing those tokens from scratch.
The key phrase is stable prefix.
Prompt caching is not a semantic cache. Claude is not saying, “This looks similar enough.” It is matching the beginning of the request exactly enough to safely reuse the cached computation. If the reusable part changes, even slightly, you may get a cache miss.
Typical cacheable content includes:
- Long system prompts
- Tool definitions
- Static application instructions
- Product documentation
- Codebase context
- Style guides
- Legal or compliance text
- Few-shot examples
- Long-running agent memory snapshots
- Retrieval chunks that are reused across multiple turns
Typical non-cacheable or less useful content includes:
- The user’s latest question
- A changing timestamp
- Request IDs
- Randomized metadata
- Dynamic retrieved snippets that differ every call
- Conversation turns that are constantly appended before the cache boundary
The design goal is simple: put the expensive, stable stuff first; put the small, changing stuff last.
Cache Writes vs Cache Reads
Claude prompt caching has two relevant billing categories:
| Billing category | What it means | Typical cost behavior |
|---|---|---|
| Cache write | Claude processes and stores a cacheable prefix for future reuse | More expensive than normal input tokens |
| Cache read | Claude reuses a previously cached prefix | Much cheaper than normal input tokens |
| Normal input | Tokens not cached or not matching a cache | Standard input price |
| Output | Claude’s generated response | Standard output price |
The first time you send a cacheable prefix, you pay a cache write price for those tokens. On subsequent matching requests within the cache’s time-to-live, you pay the cheaper cache read price.
Exact pricing varies by model and provider, so always check the current rate card. In Anthropic-style pricing, cache writes are usually priced at a premium over normal input, while cache reads are a small fraction of normal input cost. That means caching is most valuable when a prefix is reused multiple times.
If you are accessing Claude through a gateway such as AI Prime Tech, which resells cheaper Claude API access alongside GPT-5.5 and Gemini 3, the same principle applies: cache writes cost more than reads, and the savings compound when you reuse stable prefixes often.
TTL: The Cache Is Temporary
Prompt caches are not permanent storage. They have a TTL, or time-to-live.
Claude commonly supports short-lived caching, often around five minutes by default, with longer TTL options available for some configurations. The exact TTL support can vary by model, API version, and provider.
This has important architectural consequences:
- If users ask several follow-up questions within a few minutes, cache hit rates can be excellent.
- If the same document is used once every few hours, a short TTL may not help much.
- If a background agent loops through many tool calls quickly, caching can save a lot.
- If traffic is sporadic, you may need to batch, prewarm, or accept occasional cache writes.
Think of prompt caching as a hot working-set optimization, not a replacement for a vector database, object store, or long-term memory system.
What Breaks a Cache Hit?
Cache hits are fragile by design. Claude can only reuse a cache when the request prefix matches the cached prefix. The most common cache breakers are surprisingly mundane.
1. Changing Content Before the Cache Boundary
If you put a timestamp near the top of the system prompt, every request becomes unique:
Current time: 2026-06-11T10:31:02Z
You are a helpful assistant...
That timestamp changes every call, so the prefix changes every call.
Better:
You are a helpful assistant...
[large stable instructions]
Then put dynamic values later:
Current time: 2026-06-11T10:31:02Z
User question: ...
2. Reordering Tools or Instructions
Tool definitions are often large, especially for agents. Reordering tools, changing JSON schema formatting, or injecting dynamic descriptions can invalidate a cache.
Keep tool definitions:
- Deterministically ordered
- Minified or consistently formatted
- Versioned explicitly
- Free of per-request metadata
3. Appending Conversation History Before Cached Content
A common mistake is building prompts like this:
System instructions
Conversation history
Large documentation bundle
Latest user message
The conversation history changes every turn, so the documentation bundle may no longer be part of a stable prefix.
A better structure:
System instructions
Large documentation bundle
Few-shot examples
Conversation history
Latest user message
Now the large reusable part is before the dynamic part.
4. Tiny Formatting Differences
Whitespace, serialization differences, changed key order in JSON, newline normalization, or template changes can all cause misses.
Use stable renderers:
- Deterministic JSON serialization
- Fixed section ordering
- Stable markdown templates
- No random IDs in cached sections
- No “generated at” text in cacheable blocks
5. Switching Models
Caches are generally model-specific. A cache created for Claude Sonnet 4.6 should not be assumed to apply to Claude Opus 4.8, Haiku 4.5, or Fable 5. If you route requests dynamically across models, expect separate caches.
That does not mean you cannot use multiple models. It just means each model should have its own caching strategy.
Structuring Prompts for Maximum Hit Rate
The highest-leverage prompt caching trick is to design your prompt as a layered prefix.
Recommended Order
1. Tool definitions
2. System/developer instructions
3. Stable policy, documentation, code, or examples
4. Semi-stable context
5. Conversation history
6. Latest user input
7. Per-request metadata
The exact API representation depends on how you call Claude, but the conceptual order is what matters. Cache the largest stable prefix you can, and keep volatile content after it.
Use Versioned Stable Blocks
For long-lived applications, version your cacheable blocks:
<app_instructions version="2026-06-01">
...
</app_instructions>
<tool_contracts version="billing-tools-v17">
...
</tool_contracts>
<support_policy version="refund-policy-v9">
...
</support_policy>
This makes cache invalidation intentional. When the refund policy changes, the version changes. Until then, every request uses the same stable text.
Separate Stable Retrieval from Dynamic Retrieval
In retrieval-augmented generation, not all retrieved content is equally dynamic.
For example, a coding assistant may always include:
- Repository architecture overview
- Public API docs
- Style guide
- Testing conventions
Then it dynamically retrieves files relevant to the latest task.
Put the stable repository context in the cached prefix. Put task-specific retrieved snippets later. If a user is working in one area for several turns, you may also cache a semi-stable “working set” of files.
Cost Math: When Does Caching Pay Off?
Let’s use simple numbers. Suppose normal input costs 1 unit per token. Cache writes cost 1.25 units, and cache reads cost 0.10 units. These are illustrative ratios; check your actual model pricing.
Assume you have:
- 100,000-token stable prefix
- 2,000-token dynamic user/task section
- 1,000-token output
- 10 requests in the same cache window
Without caching, input cost is:
10 × (100,000 + 2,000) = 1,020,000 input-token units
With caching:
First request:
100,000 × 1.25 = 125,000 cache-write units
2,000 × 1.00 = 2,000 normal input units
Next 9 requests:
9 × 100,000 × 0.10 = 90,000 cache-read units
9 × 2,000 × 1.00 = 18,000 normal input units
Total input-equivalent units:
125,000 + 2,000 + 90,000 + 18,000 = 235,000
That is roughly a 77% input-side reduction in this simplified example.
The break-even point comes quickly when the cached prefix is large. With the ratios above, one write plus one read costs:
1.25 + 0.10 = 1.35
Two uncached sends would cost:
1.00 + 1.00 = 2.00
So even the second use can be profitable. The more requests hit the same cache within the TTL, the better the economics.
Combining Prompt Caching with Long Agents
Long-running agents are one of the best fits for Claude prompt caching.
Modern agents often include:
- Large tool schemas
- Planning instructions
- Safety rules
- Product documentation
- Codebase maps
- Prior task summaries
- Execution traces
- Intermediate observations
If every tool call resends all of that, costs balloon. With caching, you can keep the stable agent substrate hot while each step adds only the latest observation or instruction.
For example:
Cached prefix:
- Tool definitions
- Agent operating rules
- Repo map
- Coding conventions
- Test instructions
- Current task plan
Dynamic suffix:
- Latest tool result
- Next requested action
This is especially valuable with long-context models like Claude Fable 5 with 1M context, where the temptation is to include everything. Long context gives the model room to reason over large inputs; prompt caching makes repeated use of that context economically viable.
A practical pattern for agents:
- Start with a large cached base prompt.
- Keep frequently reused context before the cache boundary.
- Summarize old volatile turns into a stable task memory.
- Cache the updated task memory when it will be reused.
- Put the newest tool result and next instruction at the end.
This avoids sending a constantly growing conversation as an uncached blob.
Operational Tips for Production
Track Cache Metrics
You should log:
- Cache creation/write tokens
- Cache read tokens
- Normal input tokens
- Output tokens
- Cache hit rate by route
- Cache hit rate by model
- Prefix size
- TTL expiry patterns
If your hit rate is low, inspect the rendered prompts. Usually something dynamic has crept into the prefix.
Prewarm When It Makes Sense
For high-traffic applications, you can deliberately create a cache before users need it. For example, when a customer opens a large document workspace, prewarm the cache with the document and instructions. Follow-up questions can then hit the cache.
Do this carefully: prewarming costs money, so it only pays off when reuse is likely.
Use Model-Specific Strategies
Claude Opus 4.8, Sonnet 4.6, Haiku 4.5, and Fable 5 have different cost/performance profiles. You may want:
- Opus for hardest reasoning over cached expert context
- Sonnet for balanced agent workloads
- Haiku for fast, cheaper interactions
- Fable for very large context windows
- GPT-5.5 or Gemini 3 for fallback or comparative routing
If you use AI Prime Tech as a third-party gateway for cheaper Claude API access, model routing and cost monitoring become especially important. Prompt caching should be part of that routing strategy, not an afterthought.
Common Anti-Patterns
Avoid these:
- Putting timestamps at the top of the prompt
- Randomizing tool order
- Injecting request IDs into system instructions
- Re-rendering JSON with nondeterministic key order
- Placing conversation history before stable docs
- Caching tiny prefixes with low reuse
- Assuming caches last forever
- Switching models and expecting the same cache to hit
- Treating prompt caching as semantic similarity caching
Prompt caching is powerful, but it rewards discipline.
Final Checklist
Before shipping Claude prompt caching, ask:
- Is my largest stable content at the beginning of the request?
- Are dynamic values after the cached prefix?
- Are tools and JSON schemas rendered deterministically?
- Do I understand the TTL?
- Is reuse likely within that TTL?
- Am I tracking cache writes, reads, and misses?
- Have I calculated break-even for my actual model prices?
- Does my agent summarize volatile history into reusable stable memory?
If the answer is yes, prompt caching can be one of the easiest ways to cut Claude input costs without reducing context quality. For long-context workflows and agents, it often turns “too expensive to run repeatedly” into “cheap enough to use continuously.”
One API key for Claude Opus 4.8, Sonnet 4.6, Haiku 4.5, Fable 5, plus GPT & Gemini — up to 80% off official pricing, pay-as-you-go.
Get Your API Key →