Context Window Limits: Why 200K Tokens Isn't Really 200K
▶ Watch on YouTube & subscribe to The Stack Underflow
Everyone quotes the headline number. Claude: 200K tokens. Gemini: 1–2 million. GPT-4o: up to 1 million. The implication is that you have a massive runway before anything breaks. The reality is messier, and it affects every agent, every long coding session, and every repo-scale workflow you run.
This episode of “How Claude Actually Works” tears apart the advertised context limit, shows you what the working limit actually is, and explains why where you put information inside the window matters just as much as how much fits.
The one-sentence version: The advertised context window is not your working budget — effective capacity is roughly 60–70% of the headline, degradation hits like a cliff not a slope, and position inside the window determines whether the model actually uses what you put there.
What the context window actually contains
The context window is the model’s working memory for a single request. Everything the model can see is measured in tokens, and “everything” is a longer list than most developers expect:
| Slot | Typical size |
|---|---|
| System prompt | 2,000–5,000 tokens |
| Tool / function schemas | 1,000–4,000 tokens |
| Conversation history | Grows each turn |
| Your current message | Variable |
| Model’s output (reserved) | Up to max output tokens |
That last row is the one that surprises people. The context window is the total budget for one round trip — input and output share the same pool. The model cannot write into space that is already consumed by input. So if you hand it 190K tokens of history and ask for a 4K-token response, you are already over the line on a 200K model.
Total window budget
├── system prompt ~3K
├── tool schemas ~2K
├── conversation history grows with each turn
├── current user message variable
└── reserved for output up to max_tokens
──────────────────
200K ceiling (nominal)
The 60–70% rule: the advertised number is not the working number
Independent benchmarks — from Chroma, Nvidia, and mixed token-composition evaluations — all land on the same conclusion: models reliably perform within only 60–70% of their advertised window.
- A 200K model becomes unreliable around 130K tokens.
- A 1M token model degrades around 600–700K tokens.
And the degradation is not a gentle slope. It is a cliff. Performance stays roughly consistent up to the threshold, then drops sharply.
The relative ranking among the major models is also worth knowing:
| Model | Advertised | Effective (approx.) | Notes |
|---|---|---|---|
| Claude (Anthropic) | 200K | ~130K | Under 5% degradation across full range — most efficient |
| Gemini (Google) | 1M–2M | ~600K–1.4M | Largest window, but degrades earliest in proportion |
| GPT-4o (OpenAI) | 128K–1M | varies | Sits between the other two |
Bigger window does not equal more reliable context. Often the opposite. Gemini holds the headline record but loses coherence proportionally sooner than Claude. Claude’s 200K is smaller in absolute terms but uses it more uniformly.
The “lost in the middle” problem
Even within the working zone, not all positions are equal. Where you place information changes whether the model actually attends to it.
Research on retrieval accuracy inside the context window shows a consistent pattern across all major models:
- Start of context: ~93% recall
- End of context: ~91% recall
- Middle of context: drops to ~76%, sometimes lower
This pattern has a name: the lost in the middle problem. The model’s attention is strongest at the edges and weakest in the center.
There is an additional wrinkle based on fill level:
Window < 50% full → middle tokens get lost first
Window > 50% full → earliest tokens (the start) get lost first
This matters directly for how you structure long agentic sessions. That CLAUDE.md or project instructions file you pin at the top of your system prompt? In a long coding session, as the window fills past the halfway point, those earliest tokens start drifting out of the model’s effective attention. The anchor sinks.
What real API usage actually looks like
Here is the number that reframes the whole conversation: 78% of real production API requests use under 16,000 tokens.
Not 200K. Not 1M. 16K.
Most of the time, in most real tasks, you are nowhere near the advertised ceiling. The arms race to publish ever-larger context windows is solving a problem that the majority of actual workloads do not have.
What will bite you is not the ceiling — it is what happens in the edge cases where you approach it:
- Long multi-turn agent sessions
- Large repository ingestion (e.g., feeding a whole codebase into context)
- Workflows that accumulate tool call results turn after turn
Those are the scenarios where the 60–70% cliff and the lost-in-the-middle drop combine to make your agent behave worse the longer it runs.
Common misconceptions
-
“More context = better recall.” Not true. Performance degrades nonlinearly near the real limit, and a model with a larger window is not automatically more reliable — it can degrade earlier in proportion.
-
“The context window is just my input.” The window is the total budget for input and output in a single request. If you fill it with history, there is no room for the model’s reply.
-
“Position doesn’t matter as long as I stay under the limit.” Position matters a lot. Middle tokens are lost first in shorter sessions; earliest tokens are lost first as the window fills. Architecture your prompts accordingly.
-
“A 200K context window means the model remembers 200K tokens of conversation.” Memory and context are not the same thing. The context window is per-request working memory, not persistent storage. Nothing persists between API calls unless you explicitly include it in the next request.
Frequently asked questions
How do I know when I’m approaching the real effective limit?
Track your total token count (input + max output) per request, not just the conversation length. Most SDKs return usage metadata in the response. Set a soft alert at 60–65% of the model’s advertised window, which puts you at roughly the start of the unreliable zone.
Should I switch to a model with a larger context window for big codebases? Not automatically. A larger window only helps if the model uses it reliably. Based on current benchmarks, Claude’s 200K handles its range more uniformly than Gemini handles its million-token range. Evaluate on your actual workload, not the headline.
What is the practical fix for the lost-in-the-middle problem? Treat your context like a stack: put the most critical instructions as close to the current turn as possible (end of context), and repeat or refresh key instructions periodically rather than assuming the model retained them from the top of a long session. For agent workflows, summarize and compact history aggressively rather than letting raw turns accumulate.
Does this affect tools like GitHub Copilot or Cursor, or only direct API usage? The same physics apply to any system built on top of these models. Copilot, Cursor, and similar tools manage context windows under the hood, but they hit the same cliffs. Understanding the limits helps you debug why an AI coding tool starts producing worse suggestions deep into a large session.
Where this fits in the series
This is episode 3 of “How Claude Actually Works,” a course on the hidden mechanics that determine what AI coding tools actually cost and how they actually behave. The previous episode established that recent context dominates agent spend (62 cents of every dollar). This episode explains why that matters structurally — the window is smaller than it looks and unevenly weighted. The next episode covers context rot: why agent performance degrades the longer a session runs, even when you stay inside the working limit.
Browse all tutorials in the series for the full picture.
Found this useful? The deep version lives on YouTube — new breakdowns of how AI dev tools actually work, weekly.
Subscribe on YouTube →