Threads, context, and caching
This page explains how agent threads are transformed into model input and why that affects reliability, latency, and cost. It covers context windows, prompt caching, and compaction so you can understand what changes as conversations get longer.
From thread to model input
Section titled “From thread to model input”Models are still, in practical terms, glorified autocomplete. Coding agents structure agent threads as model inputs (i.e., input strings). Most existing coding agents follow roughly the same framework. It is simple:
So what is the context window? It is the maximum length of the input string that the model can ingest. Coding agents often reduce this window slightly from the actual model limits to reserve space for tail system messages or a compaction prompt.
Now, especially if you are going to work with Anthropic models, read this article. The author gives a useful overview of context peculiarities and how models behave as context usage grows. TL;DR: Keep your threads short:
- 200k Tokens Is Plenty Amp. 2025-12-09
Prompt caching
Section titled “Prompt caching”TL;DR: Prompt caching is an optimization for LLM APIs that lets the model reuse previously computed internal context, especially prompt prefixes such as system instructions or examples. That way, repeated similar requests do not have to be processed from scratch.
- Lessons from Building Claude Code: Prompt Caching Is Everything Thariq Shihipar. 2026-02-19
- Prompt caching OpenAI.
Compaction
Section titled “Compaction”TL;DR: Compaction extends the effective context length for long-running conversations and tasks by automatically summarizing older context when a thread approaches the context window limit. It keeps the active context focused and performant by replacing stale content with concise summaries.
- Compaction Anthropic.
- Compaction OpenAI.
- Pi Compaction Prompt Mario Zechner.
