Skip to content

Threads, context, and caching

This page explains how agent threads are transformed into model input and why that affects reliability, latency, and cost. It covers context windows, prompt caching, and compaction so you can understand what changes as conversations get longer.

Models are still, in practical terms, glorified autocomplete. Coding agents structure agent threads as model inputs (i.e., input strings). Most existing coding agents follow roughly the same framework. It is simple:

So what is the context window? It is the maximum length of the input string that the model can ingest. Coding agents often reduce this window slightly from the actual model limits to reserve space for tail system messages or a compaction prompt.

Now, especially if you are going to work with Anthropic models, read this article. The author gives a useful overview of context peculiarities and how models behave as context usage grows. TL;DR: Keep your threads short:

TL;DR: Prompt caching is an optimization for LLM APIs that lets the model reuse previously computed internal context, especially prompt prefixes such as system instructions or examples. That way, repeated similar requests do not have to be processed from scratch.

TL;DR: Compaction extends the effective context length for long-running conversations and tasks by automatically summarizing older context when a thread approaches the context window limit. It keeps the active context focused and performant by replacing stale content with concise summaries.

Attribution

Authors