What Is Context Rot and Why AI Agents Degrade Over Time
▶ Watch on YouTube & subscribe to The Stack Underflow
You have probably felt it: start an agent session and the first ten minutes are sharp. Twenty minutes in, still solid. Thirty minutes — small mistakes creep in. Forty minutes in, the agent edits the file you explicitly told it not to touch. You wonder if the model ran out of context. It didn’t. The context window is barely a third full. What you are experiencing is context rot.
Context rot is not a vibe or an unlucky streak. Chroma Research (2025) tested 18 frontier models — GPT-4.1, Claude Opus 4, Gemini 2.5 — and measured it as a repeatable, measurable phenomenon. Performance degrades long before the window fills, sometimes at just 30% fill. More tokens in, worse output out, regardless of remaining capacity.
The one-sentence version: Context rot is model performance degrading as the conversation grows longer — caused by the size of context, not the fullness of the window.
The core misunderstanding: size vs. fullness
Almost everyone’s first mental model is wrong: “The context window fills up, the model drops old instructions, performance tanks.” Logical, but that is not what the research shows.
The relevant variable is total tokens in the context, not percentage of window used. A 10,000-token conversation inside a 1M-token window can rot just as badly as that same conversation inside a 32K window. The mechanism is internal to how the model processes tokens, not an external hard limit.
Naive model (WRONG):
[============================░░░░░░░░░░░░░░░░░░░░░░░] ← performance fine here
[====================================================] ← window full, now it breaks
Reality (RIGHT):
[====░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░] ← rot begins here (~30% fill)
[====================================================] ← already well past the cliff
Three compounding mechanisms
Context rot is not a single failure mode. It is three mechanisms that compound on each other.
1. Lost in the middle
As covered in the previous episode: when a long context is processed by a transformer, content in the middle of the window receives systematically less attention than content at the very beginning and the very end. Instructions you gave 15 minutes ago drift into what researchers call the “dim zone” — still technically present, but receiving a thinner slice of model focus.
2. Attention dilution
Transformer attention is quadratic. Every token attends to every other token. At 100,000 tokens, the model is tracking roughly 10 billion pairwise relationships. Each individual relationship gets a smaller share of the model’s representational budget. The more tokens you add, the less attention any single token receives. Instructions, constraints, and critical context all thin out together.
3. Distractor interference
This is the cruelest one. When the context contains tokens that are semantically similar to what you want — but are actually irrelevant — the model cannot easily ignore them. Tool output, grep results, file listings, import traces: they introduce noise that looks like signal. The model is trained to predict the next token from everything in front of it. Semantically similar-but-irrelevant content actively pulls it in the wrong direction.
These three mechanisms do not add up linearly. They amplify each other:
| Mechanism | Standalone effect | With the others |
|---|---|---|
| Lost in the middle | Instructions fade | Fading instructions are easier to override by distractors |
| Attention dilution | Everything gets less focus | Amplifies lost-in-the-middle; spreads attention across more noise |
| Distractor interference | Wrong tokens pull on the model | More distractors, thinner attention — compound failure |
A concrete example: the widget.tsx story
A developer named Kong Tran was 40 minutes into a refactoring session at 2 a.m. He had told the agent at the start: “Do not edit widget.tsx.”
Forty minutes in, the agent edited widget.tsx.
Why? During those 40 minutes, the agent ran grep searches, read files, traced imports. The string widget.tsx appeared in tool output dozens of times — as a search result, as an import path, as a filename in a directory listing. The model has no mechanism to distinguish a user instruction from grep output. The volume of irrelevant references to widget.tsx diluted the original constraint until the model’s prediction machinery treated it like just another piece of context.
The model did not forget. It did exactly what it was trained to do: predict the next token from everything in front of it. The problem is the sheer volume of irrelevant content surrounding the important content.
The numbers: where the cliff is
Cognition (the team behind Devon) measured agent task success across sessions of varying length in 2026. Their finding:
- By the 35-minute mark, every agent’s success rate is declining.
- Doubling task duration quadruples the failure rate — not doubles, quadruples.
The relationship is not linear. The longer the task, the disproportionately worse the odds. This is the compounding nature of the three mechanisms showing up empirically.
Common misconceptions
-
“Context rot only matters when the window is nearly full.” False. Chroma’s research showed degradation beginning at ~30% fill across 18 models. The threshold is about context size, not remaining capacity.
-
“Larger context windows solve the problem.” Larger windows delay the hard cutoff but do not address attention dilution or distractor interference. A 1M-token window still rots; it just takes longer to become obvious.
-
“The model is forgetting my instructions.” It is not forgetting in any human sense — the tokens are still there. The model is being misled by semantically similar noise that outweighs the signal from a single instruction given 30 minutes ago.
-
“If I just repeat my key instructions more often, I am safe.” Repetition helps but does not eliminate the problem. Distractor interference and attention dilution still accumulate with every tool call’s output that lands in the context.
Frequently asked questions
At what context size should I start worrying about context rot? The Chroma research puts measurable degradation beginning around 30% of window fill for frontier models. In practice, for a typical agentic coding session, you will start hitting trouble somewhere between 20 and 40 minutes in — which maps to the 35-minute Cognition finding. Plan your session length accordingly, not your token count.
Does context rot affect all models equally? The Chroma research covered 18 models and found the phenomenon across all of them. The severity and exact threshold vary by model, but no current frontier model is immune. The architecture (transformers with quadratic attention) is the source, and that is shared across virtually all current LLMs.
What can I actually do about it? Three strategies are covered in upcoming episodes: prompt caching (restructure what goes in the context and when), model teardown / context reset (start a fresh session with a distilled summary rather than an endless append), and context engineering (deliberate management of what enters the context in the first place). The key shift is treating context as a resource to be managed, not a log to be accumulated.
Is this different from the “needle in a haystack” problem? Related but distinct. Needle-in-a-haystack tests measure whether a model can retrieve specific information from a long context. Context rot is about behavioral degradation in agentic tasks — wrong edits, ignored constraints, compounding errors. The lost-in-the-middle mechanism is shared; context rot is the broader operational consequence.
Where this fits in the series
This is episode 4 in the How Claude Actually Works course on “The Stack Underflow.” The previous episode introduced the lost-in-the-middle phenomenon; this one shows how that mechanism combines with attention dilution and distractor interference to produce the compound failure mode that kills long agentic sessions. The next episode covers prompt caching — the first practical tool for pushing back against rot. Browse all tutorials to see the full course sequence.
Found this useful? The deep version lives on YouTube — new breakdowns of how AI dev tools actually work, weekly.
Subscribe on YouTube →