Some words about models
Choose models by reliability, workflow fit, and pricing model, then benchmark them on your own real tasks.
Where to access free models
Section titled “Where to access free models”-
ChatGPT Codex: Available on the free plan until April 2nd, 2026.
-
Amp: Free usage of $10/day is provided if you signed up before February 10th, 2026. Otherwise, pricey.
-
OpenCode Zen: They always have some Chinese or xAI model available for free, though conversations are sent to vendors as training data. All models are hosted in the USA.
Frontier model comparison
Section titled “Frontier model comparison”-
Claude Opus
- Decent creativity, can present with an engaging personality.
- Tends not to read enough context, so it can hallucinate.
- RL-trained heavily to follow instructions and perform tool calls.
- Very expensive per token. Use short threads because cost grows quickly as thread length increases and cache gets pruned over time.
-
GPT
- Workhorse, RL-trained to gather a lot of context before answering, so slower but tends to generate the most “in-place” code.
- Cheapest cost per token.
- Codex variants have great compaction algorithms, and they can reliably work over long threads.
-
Gemini
- Looks very creative and brilliant, but the models can be unreliable (they tend to ignore instructions).
-
Chinese models
- MiniMax, Kimi, Qwen, etc. Very cheap. MiniMax M2.5 is very close to the frontier.
- Smaller versions are often open-source.
- Be careful to use US/EU inference providers only; check out OpenCode Zen, as they host all models in the USA; avoid routing sensitive work through providers in other jurisdictions.
Model pricing
Section titled “Model pricing”There are two very different pricing worlds in AI tools: API pricing (pay per token) and app subscriptions (pay a flat monthly fee with usage limits).
API pricing
Section titled “API pricing”On price trackers like models.dev and llm-prices.com , you’ll usually see these fields:
- Input cost: what you pay for non-cached input tokens sent to the model.
- Output cost: what you pay for tokens generated by the model.
- Cache write cost: what you pay when the provider stores a prompt prefix in cache (so it can be reused later).
- Cache read cost: what you pay when later requests reuse that cached prefix.
Simple mental model:
total cost = input + output + cache write + cache readIf you’re integrating directly with an LLM API, lowering cost per request/session mostly means reducing the most expensive token categories:
- Keep prompts stable at the top (system prompt, tool defs, long instructions) to maximize cache hits.
- Move dynamic parts (timestamps, random IDs, volatile context) lower in the prompt so they don’t invalidate the cached prefix.
- Cap output length when possible (
max_tokens/ equivalent). - Keep threads compact. Good cache hit rates help, but each turn still adds some uncached tail tokens, and cache entries can expire/prune over long sessions.
If you’re using an agent harness, many of these optimizations are handled internally (prompt layout, caching, compaction). Your main cost levers are usually model choice and keeping tasks/threads scoped.
Subscriptions
Section titled “Subscriptions”Subscriptions are different from API billing. You pay a monthly fee for usage inside a product, usually with fair-use limits or soft/hard caps. These plans do not include raw API credits for your own apps.
For most people, this is the cheapest way to get heavy day-to-day usage. The effective subscription-vs-API ratio can swing a lot as vendors change limits, model mixes, and pricing.
Common subscription options: