Anthropic shipped Claude 4.5 in September 2025, with major improvements on prompt caching and tool use. On a client running an internal-support AI assistant (4,000 conversations/day), the numbers speak.
What prompt caching is
Anthropic API lets you mark parts of the prompt (system prompt, knowledge base) as cached. On subsequent prompts, those parts don't count as input tokens — cost down 90% and lower latency. Works if the same chunk is reused within 5 minutes.
Typical setup
For our client:
- System prompt (15k tokens): cached.
- Company knowledge base (40k tokens): cached.
- User conversation: not cached (changes every turn).
Costs before and after
Without cache, per turn: 55k input tokens × 4,000 conversations × 8 average turns = 1.76 billion tokens. Cost (Claude Sonnet 4.5): ~$5,300/month.
With cache: 55k cached + 800 new × 4,000 × 8 = ~$620/month. Savings: ~88%.
Latency
Time-to-first-token with cache: ~280 ms. Without cache: ~1.4s. For real-time chat UX, under 500 ms is the difference between "smooth" and "slow".
When NOT to cache
- Prompts that change per request (per-user dynamic system prompt).
- Volumes below 200 requests/day: setup doesn't pay back.
- Tiny knowledge base (under 1k tokens): break-even doesn't kick in.
What we changed in our prompts
We structure prompts to maximise cache hits: stable part at the top (cached), variable part at the bottom. A small refactor that saves thousands of euros/year on volume systems.