Prompt Caching: How Anthropic and OpenAI Differ (and the Catch)

June 23, 2026 · The Hidden Cost of AI Coding (part 5)

▶ Watch on YouTube & subscribe to The Stack Underflow

In a multi-step AI agent, the model has no memory between calls. Every step has to resend system instructions, tool schemas, and the full conversation history — only the newest user message is actually new. Episode 2 of this series put a number on the waste: 62% of a typical agent’s input bill is resent context. Prompt caching is the direct fix: pay to process a stable block of tokens once, then reuse that work on every subsequent call at a fraction of the cost.

Most teams never configure it, leaving 25–50% of their input bill untouched. This tutorial walks through how caching works on both Anthropic and OpenAI, what the break-even math looks like, and the single most common mistake that silently destroys your hit rate.

The one-sentence version: Prompt caching lets you pay once to process a stable, repeated prefix and reuse that work on every subsequent call — but only if your prompts are structured so the stable content comes first.

Why the same tokens keep getting billed

Consider a three-step agent run:

Step 1:  [System prompt] + [Tool schemas] + [User message 1]
Step 2:  [System prompt] + [Tool schemas] + [History so far] + [User message 2]
Step 3:  [System prompt] + [Tool schemas] + [History so far] + [User message 3]

Steps 2 and 3 resend almost everything from step 1. Without caching, you pay full price to re-tokenize and re-process that identical block each time. Three steps, three identical bills for the same content.

Prompt caching breaks that pattern. The first time the model sees the big stable block, it does the work and saves the result (the KV cache). Every subsequent request that begins with the same token sequence reuses the saved result instead of recomputing it. Same answer quality, dramatically less compute, lower latency.

Two flavors: Anthropic vs. OpenAI

The two major providers implement caching very differently in production.

FeatureAnthropicOpenAI
How it’s enabledExplicit — you add cache_control markersAutomatic — any prefix ≥ 1,024 tokens
Read discount90% off (pay 0.1x)50% off
Write cost1.25x normal input priceNo extra charge
Default TTL5 minutes (1-hour option available at higher cost)~1 hour (varies by model)
VisibilityCache write/read tokens in API responseCache tokens in usage object

Anthropic: explicit caching with cache_control

You mark the end of the stable prefix with a cache_control header on the relevant message or content block. Anthropic then knows exactly where to split the cacheable portion from the dynamic tail.

The economics require some thought. A cache write costs 1.25x the normal input price. A cache read costs 0.1x. The break-even point is roughly 1.5 reads per write — if you write a cache entry and read it only once, you actually pay more than if you had skipped caching entirely. Cache a prompt, use it twice, and you’re ahead.

The default TTL is 5 minutes. An extended 1-hour TTL is available at a higher write rate. If your agent runs are spaced further apart than the TTL, your hit rate collapses and you pay the write premium with none of the read savings. Always plan around your real call frequency.

OpenAI: automatic caching

OpenAI caches transparently. Any prefix longer than 1,024 tokens is eligible — no markers needed. Reads cost 50% off; there is no extra write charge. The trade-off is less control: you can’t target exactly what gets cached, and you can’t force a cache flush.

The 50% discount is smaller than Anthropic’s 90%, but the zero write cost and automatic operation make it simpler to capture savings without restructuring prompts.

The one mistake that kills your hit rate

Caching only works if the prefix is byte-for-byte identical on every call. Change even a single token at the start of the stable block and the cache misses — even if the next 10,000 tokens are identical.

The classic trap: embedding a timestamp, request ID, or user identifier near the top of the system prompt.

# BAD — timestamp in the prefix kills cache hits
system_prompt = f"""
Current time: {datetime.utcnow().isoformat()}
You are a helpful assistant...
[10,000 more tokens of stable content]
"""

Every call generates a unique prefix. Zero cache hits. You pay the write cost every time with no reads to offset it.

The fix is a simple structural rule:

# GOOD — stable content first, variable content last
1. System prompt (static instructions)
2. Tool schemas (rarely change)
3. Conversation history (grows but is stable for this call)
4. Current user message (only truly new thing)

Put anything dynamic — timestamps, user IDs, session state — at the very end. The prefix the model reads first must be identical across calls.

Real numbers

A 10,000-token prefix on Claude Sonnet 3.5 at uncached price costs roughly $0.03 per call.

ScenarioCost
10 calls, no caching$0.30
10 calls with Anthropic caching (1 write + 9 reads)~$0.0375 + (9 × $0.003) = ~$0.065
Savings~78% cheaper

At scale, connecting back to episode 2’s 62% figure: prompt caching done well can knock 30–50% off your total input bill with zero change in output quality. On a $1,000/month spend, that’s $300–$500 back for one afternoon of prompt restructuring.

Three things that make caching pay off

  1. Stable prefix at the top. Variable tokens near the start destroy hit rates.
  2. Enough reads per cache write. For Anthropic, you need at least 1.5 reads per write to break even. More is better.
  3. Measurement. Check the cache_read_input_tokens and cache_creation_input_tokens fields in API responses. If your hit rate is low, audit your prompt order before assuming caching is broken.

Common misconceptions

  • “Caching changes the model’s output.” It does not. The KV cache stores intermediate computation results, not final answers. Output quality is identical to a fully uncached call.
  • “Caching is automatic on Anthropic.” Only on OpenAI. Anthropic requires explicit cache_control markers. If you don’t add them, nothing is cached.
  • “The cache persists indefinitely.” Anthropic’s default TTL is 5 minutes. If your use pattern has long gaps between calls, you will pay the write premium repeatedly without accumulating reads.
  • “As long as the content is the same, it will cache.” Token order matters too. Identical text structured in a different order produces different tokens and misses the cache.

Frequently asked questions

Does prompt caching work across different users or sessions? No. Cache entries are scoped to your API key, not shared globally. But within your system, any call from any user that starts with the same stable prefix will hit the same cache entry — which is precisely why keeping the system prompt and tool schemas stable across users multiplies the savings.

What happens if I update my system prompt mid-deployment? The cache entry is invalidated and must be written again at 1.25x cost. For Anthropic, design your system prompt to be stable for at least a session (ideally the full TTL window). For OpenAI, updates propagate automatically with no extra penalty beyond the first uncached call.

Should I use Anthropic or OpenAI caching? It depends on your call pattern. Anthropic’s 90% read discount wins at high call frequency; OpenAI’s zero write cost wins for infrequent or unpredictable workloads where you might not accumulate enough reads to break even. Run the math against your actual call distribution.

How do I know if caching is actually working? Anthropic returns cache_creation_input_tokens (tokens written) and cache_read_input_tokens (tokens served from cache) in every API response. OpenAI surfaces similar fields in the usage object. If cache_read_input_tokens is zero across repeated calls with the same prefix, your prefix is not stable.

Where this fits in the series

This tutorial is part of How Claude Actually Works, a course on The Stack Underflow that dissects the real mechanics and costs behind building with AI models. Episode 2 established that resent context is the dominant cost driver; this episode shows the primary remedy. Episode 6 continues with model tiering — why routing tasks to cheaper, smaller models is the next lever to pull after caching. Browse all tutorials to follow the full series.

Found this useful? The deep version lives on YouTube — new breakdowns of how AI dev tools actually work, weekly.

Subscribe on YouTube →