Claude 4.5 and prompt caching savings: real numbers in production

Anthropic shipped Claude 4.5 with more efficient prompt caching. For teams running AI assistants in production, savings are real.

Anthropic shipped Claude 4.5 in September 2025, with major improvements on prompt caching and tool use. On a client running an internal-support AI assistant (4,000 conversations/day), the numbers speak.

What prompt caching is

Anthropic API lets you mark parts of the prompt (system prompt, knowledge base) as cached. On subsequent prompts, those parts don't count as input tokens — cost down 90% and lower latency. Works if the same chunk is reused within 5 minutes.

Typical setup

For our client:

System prompt (15k tokens): cached.
Company knowledge base (40k tokens): cached.
User conversation: not cached (changes every turn).

Costs before and after

Without cache, per turn: 55k input tokens × 4,000 conversations × 8 average turns = 1.76 billion tokens. Cost (Claude Sonnet 4.5): ~$5,300/month.

With cache: 55k cached + 800 new × 4,000 × 8 = ~$620/month. Savings: ~88%.

Latency

Time-to-first-token with cache: ~280 ms. Without cache: ~1.4s. For real-time chat UX, under 500 ms is the difference between "smooth" and "slow".

When NOT to cache

Prompts that change per request (per-user dynamic system prompt).
Volumes below 200 requests/day: setup doesn't pay back.
Tiny knowledge base (under 1k tokens): break-even doesn't kick in.

What we changed in our prompts

We structure prompts to maximise cache hits: stable part at the top (cached), variable part at the bottom. A small refactor that saves thousands of euros/year on volume systems.