Pranav Belhekar

Visual guide

The KV Cache, Illustrated

How modern LLMs actually remember the conversation while they generate. From the redundant computation that made naïve attention untenable, through what's exactly inside the cache, to the optimizations every production team uses on top: paged attention, GQA, prompt caching, attention sinks, speculative decoding. With worked math and thirteen diagrams.

25 min read

The first token of a 50,000-token request comes back in four seconds. Every token after that comes back in thirty milliseconds. The model is the same model. The GPUs are the same GPUs. The request is the same request. Strip out one engineering hack and that thirty-millisecond number becomes minutes, the API costs you’ve quietly accepted become a hundred times higher, and “long context” stops meaning anything useful.

The hack is the KV cache: a quiet mountain of intermediate state that lives in GPU memory while your request runs. It is, depending on how you count, the single most consequential data structure in modern LLM serving. Almost every cost curve, every latency oddity, every “prompt caching saves 90%” line item on a vendor’s pricing page is downstream of it. Once you see it, the whole inference stack (paged attention, GQA, prompt caching, attention sinks) stops being a list of acronyms and starts being the inevitable consequence of one problem and one fix.

The problem is wasteful redundancy. The fix is to cache the parts that don’t change. Everything past that is a refinement on the same idea.

What’s in this guide: the problem · the cache, exactly · anatomy · one step worked out · growth · memory math · prefill vs decode · GQA · paged attention · prompt caching · attention sinks · where it’s going · what it isnt · in your code · references

The problem: naive attention redoes its homework

To understand what the cache solves, look at what would happen without it. A transformer generating one token at a time, with no cache, has to recompute attention over the entire sequence so far on every step.

Recall how attention works inside a single layer. The input sequence (each token already an embedding vector) gets projected three ways to produce a Query matrix Q, a Key matrix K, and a Value matrix V. Attention is then softmax(Q · Kᵀ / √d) · V. A long way of saying: for each token’s query, take a weighted average over all tokens’ values, where the weights come from how well each token’s key matches that query. The output passes up the layer stack and eventually gets decoded into the next token.

When you generate the next token from a sequence of length N, the part that actually matters is the very last row of that big attention output: the row corresponding to “what should the next token be.” Everything else is by-the-way. But to compute that one row, the naïve implementation builds the entire Q, K, V matrices for all N tokens and runs the whole N×N attention. Most of that work is identical to the work you did the previous step.

GENERATING FOUR TOKENS, NO CACHE Step 1 compute K, V for token 1 K₁ V₁ 1 fresh · 0 wasted Step 2 redo K₁ V₁ · then K₂ V₂ K₁ K₂ V₁ V₂ 2 fresh · 2 wasted Step 3 redo K₁ V₁ K₂ V₂ · then K₃ V₃ K₁ K₂ K₃ V₁ V₂ V₃ 2 fresh · 4 wasted (cumulative) Step 4 redo K₁ V₁ K₂ V₂ K₃ V₃ · then K₄ V₄ K₁ K₂ K₃ K₄ V₁ V₂ V₃ V₄ 2 fresh · 10 wasted (cumulative) For an N-token sequence: total K/V projection ops: 2 · (1 + 2 + 3 + … + N) = N(N+1) of which actually new: 2N wasted recomputation grows as N²: the wall the cache exists to demolish
Generating four tokens, no cache. Every step recomputes K and V projections for every prior token. Past the first step the recomputed cells (greyed in red) outnumber the new ones. By token N you've recomputed the same K and V vectors O(N) times.

The redundancy is staggering. To generate a sequence of N tokens, you’ve recomputed the K and V projections for token 1 roughly N times. For token 2, N-1 times. Total work scales as N² per layer, which is fine for N=100 and catastrophic for N=10,000 and physically impossible for N=200,000.

In practice this means that a 200K-token completion would take longer than a model release cycle, and cost more than a small mortgage. The cache exists because nobody could afford the bill, and nobody could wait for the response.

The insight: K and V of past tokens never change

Look at the attention computation again. The Q, K, V projections are linear: a matrix multiply with the layer’s projection weights, no nonlinearity in the way, no dependence on anything but the token’s own embedding and the layer’s frozen weights. So for any given past token at a given layer, its K vector and its V vector are deterministic. They are computed once and never need to change.

The only thing that does change between generation steps is the new token. Its query, q_new, hasn’t been computed yet. It’s the question “what should I attend to next?” But it doesn’t need the keys and values of past tokens to be recomputed. They’re already known.

The cache is just that recognition, made operational. Per layer, per attention head, store the K and V vectors of every past token, indexed by position. When generating token N+1, compute only q_new (one vector, not N). Look up K and V from the cache (no matmul, just memory load). Run attention from q_new against the cached K and V to get the output. Done.

SAME FOUR STEPS, WITH A KV CACHE KV cache persists across steps: past K and V are stored, not recomputed STEP 1 K₁ fresh STEP 2 K₁ cached K₂ fresh STEP 3 K₁ K₂ K₃ STEP 4 K₁ K₂ K₃ K₄ 3 cached · 1 fresh Per step: only q_new is computed one query vector against the entire cache → one row of attention math, not a square O(N) work, not O(N²) K/V projection ops over N tokens naive: 2 · (1 + 2 + 3 + … + N) = O(N²) cached: 2 · N = O(N) The cache trades a small persistent memory footprint for a quadratic cut in compute. Q is fresh. K and V come from memory. Scoring still reads all N; that part never goes away.
Same four steps, with a cache. Past K and V vectors are stored, not recomputed. Only the new token's Q is fresh. The shape of the work has flipped: each new token costs one row of attention math, not a full N×N matmul.

The shape of the work has flipped. Each new token now costs one fresh projection and one row of attention against the cache, instead of a full re-projection of the past plus a square attention matrix. The cache itself grows linearly: one new K vector and one new V vector per layer per head per token.

This is what makes streaming generation feel instant after the first token. The model isn’t doing less attention. It’s doing exactly as much attention as it ever was. It’s just stopped redoing work it already did.

What the cache does and doesn’t fix

Worth being precise here, because this is where careful readers overcorrect in the other direction. The cache does not make attention constant-time. Generating token N+1 still computes a score against every one of the N cached keys. Per-token attention cost still grows linearly with context length, cache or no cache. Add that up over a full generation and decoding is still quadratic in total. There is no free lunch on context.

What the cache eliminates is recomputation:

  • K/V projection work per step: N tokens re-projected → 1 token projected
  • Attention scoring per step: an N × N square → a single 1 × N row

Summed over a whole N-token generation, projection work drops from O(N²) to O(N), and attention scoring drops from O(N³) to O(N²). A much cheaper quadratic, and one that is paid in memory reads rather than matmuls.

The 1 × N row is the part that never goes away, and it’s exactly why decode is memory-bound: every step must read the entire cache even though it computes almost nothing. Token #50,000 of a response is slower and more expensive than token #50 not because the math got harder but because the cache got bigger. The KV cache turned generation from computationally impossible into bandwidth-expensive. Most of the rest of this guide is about engineering around bandwidth-expensive.

What actually lives in the cache

Per layer, per attention head, per token, the cache stores two vectors of length head_dim: one K vector and one V vector. If you stack the K vectors for every token a head has seen, you get a (seq_len × head_dim) matrix. Same for V. Multiply by number of heads, you have the per-layer cache. Multiply by number of layers, you have the full per-request cache. Multiply by batch size, you have the full GPU-resident cache for a server handling multiple requests at once.

A useful mental model: think of the cache as a 5-dimensional tensor of shape [n_layers, 2, n_kv_heads, seq_len, head_dim], where the “2” is the K-vs-V distinction. Most serving stacks add a sixth dimension for batch, but within one request the first five are what you have.

SHAPE: [ n_layers , 2 , n_kv_heads , seq_len , head_dim ] K V layer L K V layer 1 K_cache[0] V_cache[0] layer 0 num_heads (fixed per model) seq_len → this is the axis that grows head_dim (into the page; fixed per model) One cell of the cache [layer L , K , head h , token t] is a vector of length head_dim: the key (or value) representation of token t at layer L, head h. Fixed: n_layers · n_kv_heads · head_dim come from the model architecture. Growing: seq_len grows by 1 per generated token. Every other axis is fixed.
The KV cache as a stack of per-layer K and V matrices. Inside each layer, every head owns its own (seq_len × head_dim) slab. Sequence length is the only axis that grows during generation; every other axis is fixed by the model architecture.

Some concrete numbers to anchor it. Llama 3 8B has 32 layers, 32 query heads (and 8 K/V heads after grouped-query attention; more on that in a moment), head_dim 128. At a 2,048-token context, batch size 1, fp16, the cache size is:

cache_bytes = 2 (K + V) × 32 layers × 8 KV heads × 2048 tokens × 128 head_dim × 2 bytes ≈ 268 MB

Per request. At 8K context it’s just over 1 GB. At 32K, just over 4 GB. The Llama 3 8B model weights themselves are about 16 GB at fp16. The cache, at 32K context, is a quarter of the model’s footprint per concurrent request. This is why serving long context at scale is a memory problem, not a compute problem.

For comparison, Llama 3 70B (80 layers, 64 query heads / 8 KV heads, head_dim 128) at the same 32K context: 2 × 80 × 8 × 32768 × 128 × 2 ≈ 10.7 GB per request. The 70B model weights are about 140 GB. At 64K context it’s 21 GB: a quarter of an H100’s entire HBM, for a single request. This is where the bills go.

One attention step, worked out

Pick toy numbers small enough to follow in your head: 2 layers, 4 heads, head_dim 8. You are about to generate token #7 of a sequence. The cache for layer 0 currently holds 6 K vectors and 6 V vectors per head: a (4 × 6 × 8) K tensor and an identically shaped V tensor. Same for layer 1.

The flow for generating token #7:

  1. Look up the embedding of the last generated token (token #6). Call it x, a vector of length d_model. For our toy: d_model = head_dim × num_heads = 8 × 4 = 32.
  2. Enter layer 0. Compute three projections: q = x · W_Q, k = x · W_K, v = x · W_V. Each is a vector of length 32. Reshape each into 4 heads of dimension 8.
  3. For each head h ∈ {0, 1, 2, 3}:
    • Append k[h] to K_cache[layer=0, head=h]. The cache is now 7 vectors wide, shape (7 × 8).
    • Append v[h] to V_cache[layer=0, head=h].
    • Compute attention output for the new token (a one-row matmul, not a square one):
      • scores = q[h] · K_cache[0, h]ᵀ / √8 → shape (1 × 7)
      • weights = softmax(scores) → shape (1 × 7)
      • out[h] = weights · V_cache[0, h] → shape (1 × 8)
  4. Concatenate the four head outputs → (1 × 32). Apply the output projection W_O. Add the residual, run the feed-forward block, residual again. This is the layer-0 output for token #7.
  5. Pass that vector up to layer 1 and repeat the whole flow.
  6. After all layers, run the final layer-norm and the unembedding to get logits over the vocabulary. Sample → token #7.
ONE STEP, ONE HEAD · TOY: head_dim = 8, cache holds 6 prior tokens, generating token 7 1. q · Kᵀ / √d → scores q (1 × 8) · fresh · Kᵀ (8 × 7) · cached scores (1 × 7) softmax weights (1 × 7) 2. weights · V → attention output weights (1 × 7) · V (7 × 8) · cached attention output (1 × 8) Per head, per step • 1 vector projected (q) • 1 row matmul vs cache • 1 cache append (k, v) No square matmul. Memory-bound, not compute-bound.
One attention step with the cache. The fresh q vector (1 × 8) hits the cached K matrix (8 × 7) to produce a 1 × 7 score row. Softmax that row. Multiply by cached V (7 × 8). Result: a 1 × 8 vector, the attention output for the new token. Greyed cells came from cache; bold cells were computed this step.

Two things are worth burning into the picture. First, the cache append is a memory operation: there’s no matmul against the past. Second, the attention matmul is now a single row of Q against the entire K cache, not a square N×N matmul. The compute cost of attention per token is linear in sequence length, not quadratic.

The corollary is that decode is memory-bandwidth-bound, not compute-bound. The GPU is mostly reading K and V vectors out of HBM and feeding them into a tiny matmul. The tensor cores sit at single-digit-percent utilization. This is the lever almost every inference optimization pulls on.

The cache grows one token at a time

Animate it in your head. The cache starts empty. Token #1 arrives, three projections happen, K_cache and V_cache each gain one column per head per layer. Token #2 the same, with attention now seeing two K and two V vectors per head. By token N, every layer has accumulated an N-column K matrix and an N-column V matrix per head.

CACHE GROWTH · one layer, one head, head_dim = 8 After token 1 1 × 8 cells After token 2 2 × 8 cells After token 4 4 × 8 cells After token 8 8 × 8 cells Per token, per layer, per head: • 1 K vector appended (8 numbers) • 1 V vector appended (8 numbers) = 16 floats added per head × n_layers × n_kv_heads per token Cache size vs sequence length size tokens 2K 8K 32K 128K Perfectly linear. No compression at this layer. Every token costs the same marginal memory.
The cache after one, two, four, and eight tokens. Same layer, same head. Each new token appends exactly one K column and one V column. The cache footprint scales perfectly linearly with sequence length. There is no batching trick, no compression, no shortcut at this layer of abstraction.

This is the linear-in-time-and-space curve that makes long context tractable. It is also the curve that turns into a hockey stick at scale: every concurrent request keeps its own cache, every cache grows with every token, and HBM has a hard ceiling. The cache is what every serving framework spends its life trying to manage.

Why long context is expensive

The cost of holding a request in memory is the formula we keep referring back to:

cache_bytes = 2 × n_layers × n_kv_heads × seq_len × head_dim × bytes_per_param

For a few real models at typical contexts, fp16:

cache_bytes = 2 · n_layers · n_kv_heads · seq_len · head_dim · bytes_per_param K + V layer count heads after GQA grows w/ tokens per-head width 2 for fp16, 1 for int8 three of these are fixed by the model; one of them is what you choose Real models, fp16, batch size 1 Model Layers KV heads head_dim Context Cache / request Llama 3 8B 32 8 128 8 K ≈ 1.07 GB Llama 3 8B 32 8 128 32 K ≈ 4.29 GB Llama 3 70B 80 8 128 32 K ≈ 10.7 GB Mistral 7B 32 8 128 32 K ≈ 4.29 GB Llama 3 70B (hypothetical MHA) 80 64 128 32 K ≈ 85.9 GB ← entire GPU Cache is per request. Two concurrent users = double the footprint. This is the constraint on serving throughput.
The KV cache memory equation plugged into real models. The cache is per-request; every concurrent user keeps their own. For a long-context model at scale, the cache, not the weights, is what sets your batch size.

A serving box typically holds many concurrent requests in HBM at once. The KV cache, not the model weights, is what determines how many requests you can batch. This is why batching strategy is the central problem in inference serving, and why frameworks like vLLM, TGI, TensorRT-LLM, and SGLang build their architecture around managing this one tensor efficiently.

It also explains a subtle thing about pricing. Output tokens cost more than input tokens, and not (only) because they are “generated.” Each output token holds the entire cache in memory for the duration of its decode step. The marginal cost of generating token #50,000 of a response is much higher than generating token #50, because the cache is much bigger and every step is dominated by memory bandwidth.

Why the first token is slow

The full generation of a response has two phases that look nothing like each other.

Prefill processes the entire prompt at once. The model runs forward through every layer and computes K and V for every prompt token in parallel. This is one big GPU-friendly matmul per layer. It produces the initial cache and the logits for the first generated token.

Decode generates one token at a time. Each step is the small matmul we walked through above, plus a memory read of the entire current cache.

A 50,000-TOKEN PROMPT WITH A 1,000-TOKEN RESPONSE · ON THE SAME TIME AXIS time 0s 2s (TTFT) ~12s (response done) prefill 50,000 tokens, parallel Prefill: compute-bound one big matmul per layer over all prompt tokens at once throughput ~ 25,000 tokens / sec decode 1,000 tokens, sequential, one at a time Decode: memory-bound tiny matmul each step, dominated by HBM reads of the cache throughput ~ 100 tokens / sec ~2s (TTFT) ~10s (decode) Same model, same GPUs, same request. Two completely different bottlenecks back to back.
Prefill vs decode, on the same horizontal time axis. A 50,000-token prompt prefills in roughly the time it takes to decode ten tokens. Prefill is compute-bound and parallel; decode is memory-bound and inherently sequential.

The asymmetry is brutal. A 50K-token prompt with a 1K-token response: prefill processes 50,000 tokens in two or three seconds; decode generates 1,000 tokens, each one bottlenecked by HBM bandwidth, in maybe ten seconds. Total wall clock: thirteen seconds. The first token sits behind the prefill. That is the “time to first token” you see in spec sheets. After that, decode is what you feel.

This asymmetry is also what prompt caching exploits. If the prefill is the same prefill, the work to build the cache is identical, so save the cache for next time.

Sharing K and V across heads: MHA, MQA, GQA

The single most effective architectural change to the KV cache in the last few years is not changing how the cache works. It’s making it smaller.

Classical multi-head attention (MHA) gives every query head its own K and V. Llama 1 with 32 query heads had 32 K heads and 32 V heads. That is a 32× multiplier on cache size that nobody is excited about paying.

Multi-query attention (MQA) shares one K and one V across all query heads. Cache shrinks by 32×. Quality drops noticeably; the model loses some of its capacity for fine-grained attention patterns.

Grouped-query attention (GQA) is the negotiated middle. Divide the query heads into G groups, each group shares one K and one V. Llama 3 uses 32 query heads in 8 groups of 4: a 4× cache reduction with quality within noise of full MHA.

MHA Multi-Head Attention one K, V per query head Q × 8 K × 8 V × 8 MQA Multi-Query Attention all queries share one K, V Q × 8 K × 1 V × 1 GQA Grouped-Query Attention 2 groups of 4 queries share K, V Q × 8 K × 2 V × 2 Relative KV cache size (× = MHA baseline) 1.00 × 0.13 × (1/8) 0.25 × (1/4) Llama 1, GPT-3 most pre-2023 models PaLM, Falcon aggressive, slight quality cost Llama 3, Mistral, Qwen, Gemma, DeepSeek · default Same Q heads. The lever is how many K and V you keep.
MHA, MQA, and GQA, side by side. The shaded blocks below each panel show relative KV cache size: MHA is the baseline, MQA is the most aggressive shrink, GQA strikes the production-default balance. Every modern open model is on the right side of this diagram.

Every modern open model (Llama 3, Mistral, Mixtral, Qwen, Gemma, DeepSeek) uses GQA. It is the cheapest architectural lever that exists for the KV cache, and it is the first thing to look at when you are sizing a deployment. The cost is a single line in the model config; the saving compounds across every byte of cache you allocate.

Paged attention: the cache as virtual memory

The naïve cache layout (one contiguous KV tensor per request) has the same problem early operating systems had with contiguous memory allocation. You cannot reuse fragments, you cannot share memory between processes, and one request’s footprint blocks others from fitting.

The vLLM team’s contribution was to treat the KV cache the way an operating system treats RAM. Slice the cache into fixed-size blocks: typically 16 tokens per block, per layer, per head. Maintain a block table per request that maps logical sequence position to physical blocks. The physical blocks live in a global pool managed by the server.

PAGED ATTENTION · KV CACHE LAID OUT LIKE VIRTUAL MEMORY LOGICAL SEQUENCE BLOCK TABLE PHYSICAL PAGES Request 1 "You are a helpful…" b0 b1 b2 b3 logical → physical b0 page 12 b1 page 13 b2 page 7 b3 page 22 Request 2 "You are a helpful…" b0 b1 b2 logical → physical b0 page 12 b1 page 13 b2 page 41 global physical page pool 12 13 7 22 41 · · · pages allocated on demand from a free list Both requests share the same physical pages for the system prompt prefix. No copy. No reprefill. The cache is now an address space, not a tensor.
Paged attention. Two requests on the left, each with a logical sequence of tokens. Block tables in the middle map logical positions to physical pages on the right. Two requests that share a prefix can share the same physical pages: copy-on-write semantics applied to attention state.

The wins compound. Memory fragmentation drops to a few percent. Two requests that share a prefix (the same system prompt, for instance) can literally share the same physical pages. Sequence-level parallelism, beam search, and parallel sampling all become memory-efficient because branches can share their common past.

The original paper claimed a 2-4× throughput improvement over previous serving stacks. The principle has since shown up everywhere: vLLM, TensorRT-LLM, SGLang, llamacpp’s continuous-batching mode, and most production inference frameworks now run on some variant of it. The KV cache stopped being a tensor and became an address space.

Prompt caching: the cache that outlives the request

In most APIs, the KV cache lives and dies with one request. The compute that built it (the prefill) gets thrown away the moment the response returns. The next request, even if it shares a 99% identical prefix, redoes that prefill from scratch.

Prompt caching breaks the throwaway. Anthropic, OpenAI, Google, and several inference vendors now persist the KV cache for the prefix of a request, keyed by the exact prefix bytes, for a short TTL; Anthropic’s default is around five minutes, with a longer extended-cache option. The next request that arrives with a matching prefix loads the cache directly into HBM instead of re-running prefill on those tokens.

PROMPT CACHING · THE KV CACHE THAT OUTLIVES ITS REQUEST Request 1 (t = 0) system prompt + tools + RAG context · 50,000 tokens user msg prefill: process all 50K tokens → KV cache for prefix saved on server · TTL ≈ 5 minutes Request 2 (t = 90s) same 50,000 tokens · byte-for-byte identical prefix new user msg cache HIT: load prefix cache, skip prefill on 50K tokens Bill comparison (Anthropic-style pricing) Without prompt caching: 50,000 input tokens × $1.00 / 1M = $0.0500 With prompt caching (Req 2): 50,000 cached × $0.10 / 1M = $0.0050 10× cheaper
Prompt caching. Request 1 prefills a long system prompt; the resulting KV cache is persisted on the server. Request 2 arrives minutes later with the same prefix; the cache loads back into HBM and decode starts immediately. The dollar amounts at the bottom are an honest reflection of how the bill changes.

The economic effect is large. For a long-system-prompt agentic workflow, the cache hit ratio is typically >95% on every turn after the first. Anthropic charges roughly 1/10 of normal input price for cached input. A loop that would cost a dollar per turn at full pricing costs about ten cents. The latency savings are comparable: time-to-first-token drops from “however long it takes to prefill 50K tokens” to “however long it takes to load the cache from a fast memory tier.” A tenfold improvement is normal.

There are real failure modes. The cache key is the exact prefix, byte-for-byte. Add an extra space, switch “USA” to “U.S.A”, change the date in the system prompt by one day: cache miss, full reprefill, normal price. Production teams crystallize their system prompts and pass anything dynamic as a later message in the request, after the cached boundary.

This is also the optimization that is most often left on the table. If your application is sending a long system prompt or a large RAG context on every request, you are almost certainly doing redundant prefill on most of every request. Adding the cache_control: { type: "ephemeral" } annotation on Anthropic, or the equivalent flag on the OpenAI Responses API, typically takes an afternoon and saves 70% or more on input-token spend.

Designing prompts for cache hits

Once caching is on, prompt engineering and cache engineering become the same discipline. The key is the byte-identical prefix, so the design rule is simple: static content first, dynamic content last, cache boundary between them. The classic self-inflicted miss is interpolating the clock into the top of the system prompt:

# Bad: the date changes, the bytes change, the cache misses every day
You are a support agent for Acme.
Today's date: 2026-06-15
[...8,000 more tokens of instructions and tool definitions...]
# Good: the prefix never changes; the date rides in with the first user turn
You are a support agent for Acme.
The current date is provided in the first user message.
[...8,000 more tokens of instructions and tool definitions...]

The same rule generalizes well past the date:

  • Serialize tool definitions deterministically. A JSON encoder that reorders keys between deploys silently invalidates every cached prefix in production. I learned this one from a client’s bill, not from documentation.
  • RAG retrievals go after the static boundary. They change per request; don’t let them poison the shared prefix. Anthropic allows up to four cache breakpoints, so a long agent loop can cache the system prompt and tools at one boundary and the accumulated conversation at another.
  • Shared scaffolding before per-user content. If every tenant runs on the same instructions, keep those bytes identical across users and put user-specific memory after them; the expensive shared prefix then caches across your whole user base.
  • Treat cache hit rate as an SLO. The API response reports how many input tokens were read from cache. Log it. A refactor that quietly drops your hit rate from 95% to zero looks like nothing in code review and shows up as a 10× input bill at the end of the month.

Sliding window with attention sinks

What if you don’t want the cache to grow forever? Mistral models, and a few others, cap attention to a sliding window: the last W tokens, drop the rest. The cache stays bounded at W. Generation can in principle continue forever.

The problem the StreamingLLM paper surfaced is what happens at the edge of the window. The very first tokens (the absolute oldest few) turn out to have a structural role. The model has learned to dump excess attention onto them: “attention sinks.” Throw them out, and the softmax distribution collapses, and generation quality degrades sharply within a few hundred steps.

The fix is simple: keep the first 4 tokens of the cache always, plus the last W tokens. The cache stays bounded; the sinks stay in place; the model keeps generating coherently for arbitrary sequence lengths.

SLIDING WINDOW + ATTENTION SINKS · the cache stops growing Long sequence (e.g. 16,000 tokens), window W = 4,096 t₁ t₂ t₃ t₄ sinks (pinned) evicted · outside the window last W tokens (window) current current token attends to sinks + window only; everything between is gone Cache size vs sequence length size tokens naive (linear, unbounded) sliding window: bounded at W W
Sliding window with attention sinks. The first four tokens are pinned in the cache (the sinks). The last W tokens (the window) slide forward. Everything in between is evicted. The cache footprint stays flat as the sequence grows past the window.

This is the closest thing the field has to “infinite context” right now, and it shows up in different forms in Mistral, Gemini, and many open models. The constraint it accepts is honest: the model can’t actually attend to anything beyond the window. The sinks aren’t a memory of what happened; they’re a softmax stabilizer. Anything important from outside the window has to be re-injected by the application as part of the input.

Where the field is going

Everything above is the settled layer: what production serving already looks like. The frontier is still circling the same object, and the easiest way to keep your bearings in the paper flood is to see every new technique as pulling one of six levers on the same constraint.

THE CONSTRAINT the cache is huge · it lives in HBM · every decode step reads all of it SHRINK IT GQA · ¼ the KV heads FP8 / KIVI 2-bit quantization eviction: H2O, SnapKV fewer bytes per token SHARE IT paged attention prefix sharing across requests RadixAttention routing same bytes, many requests REUSE IT prompt caching prefix persistence with a TTL, keyed on exact bytes pay prefill once BOUND IT sliding window attention + pinned attention sinks stop the growth entirely MOVE IT offload to CPU RAM / NVMe disaggregated prefill-decode (DistServe, Mooncake) right bytes, right tier AMORTIZE READS speculative decoding Medusa-style extra heads more tokens per full read Six levers, one object. Every serving paper you'll read this year pulls at least one of them.
Six levers on one object. Four of them you've already met in this guide (GQA, paging, prompt caching, sinks). Every inference optimization you'll read about this year (speculative decoding, cache quantization, RadixAttention, disaggregated serving) pulls at least one of these.

Amortize the read: speculative decoding and Medusa. Every decode step reads the whole cache to emit one token. Speculative decoding has a small draft model propose several tokens cheaply, which the big model then verifies in a single forward pass. One full-bandwidth read of the cache buys up to k tokens instead of one. Medusa gets a similar effect without a separate draft model, by training extra decoding heads that guess several tokens ahead. In a memory-bound regime, doing more arithmetic per byte read is nearly free, which is why this works at all.

Shrink the bytes: KV quantization and eviction. bytes_per_param is a straight multiplier in the memory formula, and the cache tolerates low precision surprisingly well. FP8 caches are routine in TensorRT-LLM and vLLM; KIVI pushes to 2-bit with modest quality cost. A separate line of work (H2O, SnapKV) watches attention patterns and evicts cache entries that have stopped receiving attention mass. Either approach can halve the footprint (combined, better) without touching the model weights.

Schedule for sharing: prefix-aware routing. Paged attention made prefix sharing possible; the next step is making it likely. SGLang’s RadixAttention keeps a radix tree over every cached prefix on a replica and routes incoming requests toward the machine that already holds their prefix. Cache hit rate stops being a happy accident and becomes a scheduling objective.

Split the phases: disaggregated serving. Prefill is compute-bound; decode is memory-bound. Running both on the same GPU means sizing the hardware for one and wasting it on the other. Disaggregated architectures (DistServe, Mooncake) run prefill and decode on separate GPU pools, each sized for its own bottleneck, and ship the KV cache between them over the interconnect. Moving gigabytes of cache between machines sounds absurd and benchmarks fine; the interconnects got fast enough.

Tier the storage: KV offloading. A parked agent session doesn’t need its cache sitting in HBM. Offloading systems tier the cache the way an operating system tiers memory: HBM while the request is active, CPU RAM when it idles, NVMe when it parks, reload on resume. This is the piece that makes day-long agent sessions economical rather than ruinous.

None of these is exotic anymore. If you’re choosing a serving stack in 2026, you are mostly choosing between implementations of this list.

What the KV cache is not

Half the productive work of this guide is fencing off the things people confuse with the KV cache.

THREE THINGS PEOPLE CONFUSE · none of them is the others Long-term memory outside the model Context window the input you send this turn KV cache inside the model, this request EXAMPLES • Vector DB (Pinecone, Weaviate) • Knowledge graph (Neo4j, Zep) • Conversation log + summaries • User profiles, fact stores LIFETIME persistent across sessions OWNED BY you, the application engineer not what the model "knows" EXAMPLES • System prompt • Tool/function definitions • Conversation history • Retrieved RAG chunks • The current user message LIFETIME one turn: what you sent OWNED BY you (the request payload) EXAMPLES • K and V vectors per layer • Per attention head • Per token in the request • In GPU HBM LIFETIME request only (or short TTL with prompt caching) OWNED BY the inference server, not you retrieve / paste in prefill builds the cache Data flows left to right. Nothing flows back: the model does not learn from the conversation.
Three things people confuse. The KV cache lives inside the model, for one request, ephemeral. The context window is the input you send this turn: what the model sees right now. Long-term memory (RAG, vector DBs, summarization) is engineering outside the model, feeding the context window. Long-term memory flows into context, context flows into cache. Never the reverse.
  • It is not “the model’s memory.” The model’s parameters are frozen, set in training. The KV cache is intermediate state for one in-flight request, not retained knowledge.
  • It does not persist past the request, except in prompt caching, and even then only for a short prefix-keyed window.
  • It is not what RAG fills. RAG retrieves text and appends it to the prompt. That text then participates in prefill and ends up in the cache like any other tokens. The cache is downstream of context-window engineering, not a replacement for it.
  • It is not what makes agents “remember” across sessions. That is external state (databases, vector stores, summarization), covered in A Visual Guide to AI Agent Memory.
  • It does not “learn” from the conversation. Nothing in the KV cache feeds back into the weights. The fact that the model “knows” things from earlier in the conversation is because those tokens are still in the cache. Once they’re not, they’re gone.

The KV cache is mechanism, not memory. It is the data structure that makes attention tractable during decoding. Everything we colloquially call “memory” lives elsewhere, in engineering you build around the model.

Why this matters in your code

If you take three operational things from this guide, take these.

Your bill grows with the cache, not just with token count. The marginal cost of generating token #50,000 of a response is much higher than generating token #50: same number of “output tokens” on the invoice, but vastly more memory bandwidth consumed. Output token pricing is averaged across this, which makes it look linear when the underlying cost isn’t. If your application generates long responses, your effective cost per token climbs with response length, and your tail latencies climb with it.

Prompt caching is the highest-ROI optimization most teams haven’t done. If your application sends a long system prompt or large context (RAG retrievals, multi-turn agent state) on every request, you are very likely doing redundant prefill on most of every request. Turning on the cache header typically takes a single afternoon and saves 70%+ on input-token spend. The reason it isn’t on by default is that you have to be deliberate about what your prefix is. Once you are, it’s almost free money.

Long context isn’t “free” just because the model accepts it. A 1M-token Claude or Gemini request has a KV cache in the multiple-gigabyte range. Time-to-first-token reflects the prefill of that whole prompt. Decode rate reflects HBM bandwidth pulling those gigabytes per token. Long context is not a magic substitute for retrieval; it pushes the cost from your vector DB to the model’s HBM, and the model’s HBM is more expensive.

A back-of-the-envelope for your model:

  1. Pick your model’s n_layers, n_kv_heads, head_dim. These are in every model card.
  2. Multiply: 2 × n_layers × n_kv_heads × head_dim × seq_len × 2 bytes (for fp16) → cache size in bytes.
  3. Compare to your serving GPU’s HBM (typically 40-80 GB), minus weights, minus framework overhead.
  4. That’s a rough upper bound on how many concurrent requests you can fit, assuming each runs at full context.

If you find yourself running out of headroom, the levers, in order of effort: turn on prompt caching, switch to a model with GQA if you somehow aren’t on one, reduce max context, deploy on a paged serving stack (vLLM / SGLang / TensorRT-LLM). The improvements roughly multiply.

A short opinion

The next eighteen months of inference work will be more about the KV cache than about model architecture. The architectures have mostly stabilized: decoder-only transformer, RoPE, GQA, RMSNorm, MoE in some lines, dense in others. The cache layout has not. Paged attention was 2023. Prompt caching was 2024. Every direction in the section above is still being measured in 2× and 3× wins over the previous baseline, in a field where architecture papers fight over half a percent. Nothing else in the stack is moving that fast.

If you want to be the person on your team who actually understands the production cost curve of your LLM stack, internalize this data structure. It is the single most useful piece of LLM-systems knowledge you can hold. Everything else (better serving, lower latency, lower bills, longer contexts) is a refinement on top of it.

References and further reading

  • Vaswani et al. (2017). Attention Is All You Need. The original paper that made every later optimization possible. The KV cache isn’t named here; it is an inference-side trick that emerged later.
  • Shazeer (2019). Fast Transformer Decoding: One Write-Head is All You Need. The original MQA paper.
  • Ainslie et al. (2023). GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. The architectural compromise that every modern open model uses.
  • Kwon et al. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. The vLLM paper. Required reading if you serve LLMs.
  • Xiao et al. (2024). Efficient Streaming Language Models with Attention Sinks. StreamingLLM and the discovery of the attention-sink phenomenon.
  • Pope et al. (2022). Efficiently Scaling Transformer Inference. The canonical reference for inference-time math, prefill vs decode, and the memory wall.
  • Leviathan et al. (2023). Fast Inference from Transformers via Speculative Decoding. One full cache read, k tokens out.
  • Cai et al. (2024). Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads. Speculation without a draft model.
  • Zheng et al. (2024). SGLang: Efficient Execution of Structured Language Model Programs. RadixAttention and prefix-aware scheduling.
  • Liu et al. (2024). KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache. How far the cache can be compressed before quality breaks.
  • Zhang et al. (2023). H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models. Attention-guided cache eviction.
  • Zhong et al. (2024). DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving. The case for splitting the two phases onto separate hardware.
  • Anthropic prompt-caching documentation. Current best reference for the API-level mechanics.
  • Sasha Rush’s The Annotated Transformer. For the Q/K/V math in PyTorch you can run line by line.

Get the next guide when it ships

Plus field notes from production AI, roughly twice a month.