Skip to content

Threads, context, and caching

This page explains how agent threads are transformed into model input and why that directly affects reliability, latency, and cost. It covers context windows, prompt caching, and compaction so you can understand what changes as conversations get longer.

Models are still just glorified autocomplete. What agent harnesses do is structure agent threads as model inputs (i.e., input strings). Almost every existing harness roughly follows the same framework. It’s dead simple:

You might ask: what is the context window, then? It is just the maximum length of the input string that the model can ingest. Agent harnesses often reduce this window slightly from actual model limits to reserve space for some tail system messages or a compaction prompt.

Now (especially if you’re going to work with Anthropic models), go read this article. The author wrote a great overview of context peculiarities and how models behave as context usage grows. TL;DR: Keep your threads short:

TL;DR: Prompt caching is an optimization for LLM APIs that lets the model reuse previously computed internal context (especially the prefix of prompts like system instructions or examples), so repeated similar requests don’t have to be processed from scratch.

TL;DR: Compaction extends the effective context length for long-running conversations and tasks by automatically summarizing older context when approaching the context window limit. Compaction keeps the active context focused and performant by replacing stale content with concise summaries.