Threads, context, and caching
This page explains how agent threads are transformed into model input and why that directly affects reliability, latency, and cost. It covers context windows, prompt caching, and compaction so you can understand what changes as conversations get longer.
From thread to model input
Section titled “From thread to model input”Models are still just glorified autocomplete. What agent harnesses do is structure agent threads as model inputs (i.e., input strings). Almost every existing harness roughly follows the same framework. It’s dead simple:
You might ask: what is the context window, then? It is just the maximum length of the input string that the model can ingest. Agent harnesses often reduce this window slightly from actual model limits to reserve space for some tail system messages or a compaction prompt.
Now (especially if you’re going to work with Anthropic models), go read this article. The author wrote a great overview of context peculiarities and how models behave as context usage grows. TL;DR: Keep your threads short:
- 200k Tokens Is Plenty Amp. 2025-12-09
Prompt caching
Section titled “Prompt caching”TL;DR: Prompt caching is an optimization for LLM APIs that lets the model reuse previously computed internal context (especially the prefix of prompts like system instructions or examples), so repeated similar requests don’t have to be processed from scratch.
- Lessons from Building Claude Code: Prompt Caching Is Everything Thariq Shihipar. 2026-02-19
- Prompt caching OpenAI.
Compaction
Section titled “Compaction”TL;DR: Compaction extends the effective context length for long-running conversations and tasks by automatically summarizing older context when approaching the context window limit. Compaction keeps the active context focused and performant by replacing stale content with concise summaries.
- Compaction Anthropic.
- Compaction OpenAI.
- Pi Compaction Prompt Mario Zechner.