Why RadixAttention Beats Chunked Prefill for Multi-Turn RAG (And When It Doesn’t)

Title: Why RadixAttention Beats Chunked Prefill for Multi-Turn RAG (And When It Doesn’t) Slug: radixattention-vs-chunked-prefill-llm-serving-latency Category: LLM MetaDescription: Stop recalculating KV caches. Compare RadixAttention vs. Chunked Prefill to slash TTFT and optimize production LLM serving for RAG and agents.
I spent three weeks debugging why our inference latency spiked 400% during a traffic surge, only to realize we were recalculating the exact same 4,000-token system prompt for every single user. It felt like watching a gold-medalist runner stop to tie their shoes every ten meters. We were burning H100 hours on redundant math that should have been cached, and if you’re still serving long-context RAG apps without a specialized attention management strategy, you’re likely setting your compute budget on fire too.
TL;DR / Quick Takes
- RadixAttention is a breakthrough in KV cache reuse. It treats the cache like a file system (a Trie), allowing multiple requests to share the same prefix (like a 3,000-token system prompt or a PDF context) without re-calculating it.
- Chunked Prefill solves the "head-of-line blocking" problem. Instead of making a long prompt wait for a 1-token generation to finish, it breaks the prompt into smaller bites, allowing decodes and prefills to run in parallel.
- The Winner: If your workload is multi-turn chat or RAG with a fixed knowledge base, RadixAttention (via SGLang) is your best bet. If you have massive, one-off prompts that are unique to every user, Chunked Prefill (via vLLM) will keep your latency smoother.
- Reality Check: You shouldn't choose one; the future of serving (like the latest vLLM and SGLang releases) is moving toward combining both.
The Bottleneck: Why TTFT is the Only Metric That Matters
When a user hits "Submit" on a prompt, they don't care about your total throughput. They care about how long it takes for that first word to appear on the screen. This is Time-to-First-Token (TTFT).
In production, TTFT is dominated by the prefill phase. This is where the LLM processes your input tokens and builds the "Key-Value (KV) cache." Think of the KV cache like a notebook — the model writes down everything it has learned about the current conversation so it doesn't have to re-read the entire book every time it generates a new word.
The problem? For a 10,000-token prompt, that "notebook" is massive. Calculating it takes a lot of compute. If ten users send a 10,000-token prompt at the same time, your GPU stalls. This is where we see the battle between RadixAttention and Chunked Prefill.
RadixAttention: The "File System" for Your GPU Memory
RadixAttention, popularized by the SGLang project, is honestly a bit of a "why didn't we do this sooner?" moment. Historically, KV caches were handled linearly. You have a request, you fill the cache, you delete it when the request is done. Maybe you have a basic LRU (Least Recently Used) cache for the whole prompt, but if the user changes even one word at the end of a long prompt, most caches would bust and start over.
RadixAttention changes the KV cache from a list into a tree (a Radix Trie).
Imagine you have a system prompt: "You are a helpful assistant that specialized in legal contracts..." Followed by User A asking: "Summarize this NDA." And User B asking: "What is the termination clause in this NDA?"
In a standard setup, you'd calculate the "You are a helpful assistant..." part twice. With RadixAttention, that common prefix is stored as a node in the tree. User A and User B both "point" to that node. If User A comes back for a second turn, their previous conversation history is already a node in the tree. We just append the new tokens.
How it looks in SGLang
SGLang implements this under the hood, but you can see the impact when you structure your code to reuse prefixes.
# SGLang-style pseudocode for prefix sharing
import sglang as sgl
@sgl.function
def legal_assistant(s, contract_text, question):
# This part becomes a cached node in the Radix tree
s += "You are a legal expert. Analyze the following text: " + contract_text
# This part is the unique leaf node
s += "Question: " + question + "\nAnswer:"
# When calling this multiple times with the same contract_text,
# SGLang hits the Radix cache for the first part.
⚠️ Gotcha: RadixAttention is heavily dependent on your memory overhead. If you are running out of VRAM because your model is too big for the card, the "tree" will constantly evict nodes to make room for active generations. You need enough "headroom" in your VRAM for the cache to actually be useful. I've seen teams try to use RadixAttention on 80GB A100s with models that take up 75GB, and then wonder why their cache hit rate is 0%.
Chunked Prefill: Ending the "Bully" Problem
While RadixAttention focuses on avoiding work, Chunked Prefill focuses on scheduling work better.
In vanilla vLLM (before version 0.4.3), if a new request with a 4,000-token prompt came in while the GPU was busy generating tokens for 50 other users, the GPU would stop everything to handle that big "prefill." This is because prefills are compute-bound (fast but heavy), while decodes are memory-bound (slow but light).
The prefill "bullies" the decodes, leading to massive spikes in Inter-Token Latency (ITL) for everyone else.
Chunked Prefill breaks that 4,000-token prompt into chunks (say, 512 tokens each). The scheduler then mixes these chunks with the decoding steps of other users.
| Feature | RadixAttention | Chunked Prefill |
|---|---|---|
| Primary Goal | Minimize redundant computation | Stabilize latency and increase throughput |
| Best For | Multi-turn chat, repetitive RAG prompts | High-concurrency, variable prompt lengths |
| Tooling | SGLang | vLLM (0.5.0+), TensorRT-LLM |
| Mechanism | LRU-based Radix Trie for KV Cache | Sliced prefill tasks in the scheduler |
To enable chunked prefill in vLLM, it’s usually a simple flag:
python -m vllm.entrypoints.openai.api_server \
--model nm-testing/Meta-Llama-3-70B-Instruct-AWQ \
--enable-chunked-prefill \
--max-num-batched-tokens 2048
Note: max_num_batched_tokens is the key here. It limits how many tokens (prefill + decode) are processed in one iteration.
What I’d Actually Use in Production
Look, I'll be honest — if you’re building a "Chat with your PDF" app, RadixAttention is non-negotiable.
When a user asks five questions about the same 20-page document, RadixAttention ensures that the 20-page document is only processed once. For turns 2 through 5, the TTFT will be near-instant because the model only has to "prefill" the 20 new tokens of the user's question. This is a massive win over optimizing RAG pipelines with hybrid search alone; it’s an architectural win at the inference layer.
However, if you are running a generic API where every prompt is a different snippet of code or a different customer email, RadixAttention won't help you much because there’s no prefix to share. In that case, Chunked Prefill is your savior because it prevents one jerk with a 32k context window from ruining the latency for the other 100 users asking for 10-token summaries.
For most of my recent builds, I’ve been leaning toward SGLang because it handles the Radix tree management more elegantly than vLLM's current prefix caching implementation. But vLLM is catching up fast, especially with their new V1 engine.
The Part Nobody Tells You: The Cache Eviction Nightmare
Here is the part where things get messy. In RadixAttention, your KV cache is persistent. In a high-traffic environment, your VRAM will fill up.
When the VRAM is full, the engine has to decide which "branch" of the Radix tree to prune. Most engines use a weighted LRU. But what if the "prefix" you're evicting is a very expensive-to-calculate GraphRAG Deep Dive context?
If your eviction policy is too aggressive, you end up in a "cache thrashing" cycle. You delete a prefix to make room for a new request, then the very next request needs that deleted prefix. This causes your TTFT to fluctuate wildly — sometimes it's 20ms, sometimes it's 2,000ms.
Pro-tip: Monitor your cache_hit_rate religiously. If it's below 40% in a multi-turn scenario, you likely need to:
- Increase your GPU memory (move from A10G to A100/H100).
- Quantize your model to 4-bit or 8-bit to free up VRAM for the KV cache.
- Aggressively consolidate your system prompts so they are exactly identical across requests.
Practical FAQ
Q: Does RadixAttention work with Speculative Decoding? Yes, and it’s actually a killer combo. RadixAttention handles the "past" (the prompt), while Speculative Decoding handles the "future" (predicting the next tokens). Together, they can drop your end-to-end latency by 3x.
Q: Can I use Chunked Prefill for Llama 3 70B on a single A100? Yes, but be careful. Chunking increases the total amount of time the GPU spends in the "prefill" state because of the overhead of starting and stopping chunks. You’ll get better consistency in your latency, but your total throughput might take a 5-10% hit.
Q: Is RadixAttention relevant for Small Language Models (SLMs)? Absolutely. In fact, it’s even more powerful for Edge AI and SLMs because these models often have smaller context windows and limited VRAM. Keeping that cache alive is the only way to make them feel "snappy."
Q: How does this interact with Multi-Agent Orchestration? This is huge. In multi-agent workflows, agents often share a massive "world state" or "instruction set." RadixAttention allows all 10 agents to share that same state in VRAM, rather than each agent having its own copy of the instructions.
What to Try Next
If you’re currently using transformers or a basic FastAPI wrapper for inference, stop. Move to vLLM if you have a high volume of diverse, single-turn requests. Move to SGLang if you are building complex, agentic, or multi-turn RAG systems.
The next frontier isn't just serving; it's how we handle the memory for long-context retrieval in legal contracts and other high-stakes domains. If you haven't yet, look into how prefix caching changes your cost-per-token — you might find you can afford a much larger model once you stop paying the "prefill tax" on every turn.
SocialQuote: "Stop treating your LLM's KV cache like a temporary scratchpad. It’s a file system. If you aren't using RadixAttention for multi-turn RAG, you're literally paying to re-calculate the same math over and over."
KeyStat: Implementing RadixAttention in multi-turn RAG environments can reduce Time-to-First-Token (TTFT) by up to 80% after the first turn of conversation.
Gulshan Sharma
AI/ML Engineer, Full-Stack Developer
AI engineer and technical writer passionate about making artificial intelligence accessible. Building tools and sharing knowledge at the intersection of ML engineering and practical software development.
Continue Reading

XGrammar vs. Outlines: How to Achieve 10x Higher Throughput for Structured LLM Outputs
Stop letting regex-based constraints kill your tokens per second. We compare XGrammar and Outlines for production-grade high-throughput structured decoding
10 min read
Continuous Batching Isn't Enough: Why Chunked Prefill is the Key to Scaling Low-Latency LLM Inference
Stop letting long prompts kill your inference speed. Learn how chunked prefill and continuous batching trade-off to minimize Time-to-First-Token.
9 min read
Medusa vs. EAGLE: Why Your Speculative Decoding Strategy is Probably Killing Your Throughput
Stop guessing which speculative decoding method is faster. A deep comparison of Medusa vs. EAGLE for production LLM serving with real-world benchmarks.
10 min read