Moving Beyond PagedAttention: Why RadixAttention is the New Standard for Production LLM Serving

Title: Moving Beyond PagedAttention: Why RadixAttention is the New Standard for Production LLM Serving Slug: radixattention-vs-pagedattention-kv-cache-sharing Category: LLM MetaDescription: Deep technical comparison of RadixAttention vs. PagedAttention. Learn how to optimize KV cache sharing for high-throughput LLM production environments.

Quick Summary

If you are optimizing LLM inference, you know that the KV Cache is your primary bottleneck. While PagedAttention (pioneered by vLLM) solved the problem of external memory fragmentation by allowing non-contiguous memory allocation, it struggles with efficient, automatic prefix sharing across multiple independent requests. RadixAttention (introduced by SGLang) evolves this by treating the KV cache as a dynamic tree structure. This allows for near-instantaneous reuse of system prompts, RAG context, and multi-turn chat history without manual management, significantly reducing Time-To-First-Token (TTFT) and increasing total throughput in complex workflows.

If you’re running Large Language Models (LLMs) in production today, you aren't just an AI engineer; you’re a high-performance memory manager. We’ve moved past the "can it run?" phase and are now firmly in the "how many requests per second can we squeeze out of an H100?" phase.

The single biggest obstacle to scaling inference isn't the FLOPs—it’s the memory occupied by the Key-Value (KV) cache. When we process a prompt, we store the keys and values of the attention layers to avoid recomputing them during the generation of every subsequent token. As you might have already realized, What Are Large Language Models are essentially stateful engines where that "state" (the KV cache) grows linearly with context length.

In this deep dive, I’m going to break down why the industry is shifting from the block-based management of PagedAttention to the tree-based management of RadixAttention. If you are building multi-agent workflows or RAG pipelines, this distinction will determine whether your infrastructure costs remain linear or scale sub-linearly with your user base.

The PagedAttention Paradigm: Solving Fragmentation

To understand the leap to RadixAttention, we have to respect the foundation. Before PagedAttention, we allocated memory for the KV cache contiguously. If you had a 2048-token context limit, you reserved space for 2048 tokens upfront. If the request only used 10 tokens, the other 2038 were "zombie" slots—reserved but useless. This is classic internal fragmentation.

PagedAttention borrowed a page from Operating Systems 101. It breaks the KV cache into fixed-size blocks (typically 16 tokens). These blocks don't need to be stored contiguously in VRAM. A physical block table maps logical tokens to these physical blocks.

Why PagedAttention is Great

Zero External Fragmentation: You only allocate blocks when you need them.
Flexible Mapping: A single logical sequence can point to physical blocks scattered across the GPU memory.
Basic Sharing: It allows multiple sequences to share the same physical blocks (useful for parallel sampling where one prompt generates N completions).

Where PagedAttention Hits a Wall

The limitation arises when we talk about prefix sharing across different requests. In a production RAG (Retrieval-Augmented Generation) pipeline, you might prepend the same 2,000-token document to 50 different user queries. In standard PagedAttention, unless you manually manage the "common" block IDs and carefully orchestrate your API calls, the system will re-compute and re-store that same 2,000-token prefix for every single request.

This is a massive waste of both compute (prefill latency) and memory. While vLLM has introduced "multi-step" sharing, it’s often clunky to implement for dynamic, high-turnover caches.

RadixAttention: The KV Cache as a Searchable Tree

RadixAttention is the core innovation behind the SGLang runtime. Instead of viewing the KV cache as a flat collection of blocks, RadixAttention treats it as a Radix Tree (or a prefix tree).

In this architecture, every sequence of tokens is a path in the tree. The nodes in the tree represent blocks of KV cache tensors. When a new request comes in, the engine doesn't just allocate a new block; it performs a prefix search on the Radix Tree to see if any part of that prompt has already been computed and cached from any previous request.

The Mechanics of the Radix Tree

When a request arrives, say: "System: You are a helpful assistant. Context: [Doc A]... Question: What is X?", the Radix manager does the following:

It tokensizes the input.
It matches the longest possible prefix in the current tree (e.g., the system prompt and [Doc A]).
It reuses the existing KV cache for that prefix.
It only computes the "delta" (the unique question).
It inserts the new delta as a new leaf node in the tree.

This makes prefix sharing automatic and transparent. You don’t need to tell the engine that two requests share a prompt; the engine discovers it. This is particularly vital when Scaling Test-Time Compute: Boosting LLM Reasoning Accuracy, where multiple reasoning paths often share a massive common context.

Comparing Memory Eviction Policies

In PagedAttention, when you run out of memory, the system usually uses a First-In-First-Out (FIFO) or a simple preemptive strategy. It kills the oldest request to make room.

RadixAttention uses an LRU (Least Recently Used) Eviction Policy coupled with a reference counting mechanism.

Active Nodes: Nodes currently being used by a running request have a reference count > 0. They cannot be evicted.
Cached Nodes: Nodes that finished processing but stay in VRAM. If a new request needs that prefix, it’s instantly "revived."
Eviction: When VRAM is full, the system prunes the leaf nodes that haven't been accessed for the longest time.

This turns your GPU memory into a massive, semantic cache. If a user is having a multi-turn conversation, their history stays in the Radix Tree. If they pause for 5 minutes and then come back, and no one else has pushed their data out of the cache, the "prefill" for their next turn is 0ms.

Performance Impact: RAG and Multi-Turn Chat

Let's look at the numbers. In a typical RAG scenario where you are Optimizing RAG Pipelines: Hybrid Search and Reranking, the context (retrieved chunks) is often much larger than the user query.

Metric	PagedAttention (Standard)	RadixAttention (SGLang)
Prefill Latency (2nd turn)	Recomputes everything	Near-zero (Cached)
Memory Efficiency	High (per request)	Highest (across requests)
Throughput (RAG)	1x	2x - 5x (depending on cache hits)
Implementation Complexity	Simple	High (requires tree management)

By reusing the KV cache for the retrieved documents, RadixAttention allows you to handle significantly more concurrent users. You aren't just saving memory; you are saving the GPU cycles that would have been spent re-processing the same tokens. This is the same principle I discuss when Fine-Tuning Open-Source LLMs for Domain-Specific RAG—if the data is static, don't pay for it twice.

Implementing Radix-Based Serving

If you want to move to RadixAttention, the most mature implementation is currently SGLang. Below is a conceptual guide on how you’d set this up compared to a standard OpenAI-style endpoint.

Step 1: Launching the SGLang Runtime

Unlike basic backends, SGLang manages the Radix Tree as a background process.

# Launching an SGLang server with Llama-3
python -m sglang.launch_server \
    --model-path meta-llama/Meta-Llama-3-8B-Instruct \
    --port 30000 \
    --mem-fraction-static 0.8 # Reserve 80% of VRAM for Radix Cache

Step 2: Using the Shared Prefix

In your application code, you don't actually need to do anything special to "trigger" RadixAttention, but you should structure your prompts to maximize prefix overlap.

import sglang as sgl

@sgl.function
def multi_turn_chat(s, system_msg, context, question):
    # This part (system + context) will be cached in the Radix Tree
    s += sgl.system(system_msg)
    s += sgl.user(context) 
    
    # The first time this runs, it computes. 
    # The second time, it's instant.
    s += sgl.user(question)
    s += sgl.assistant(sgl.gen("answer"))

# Example usage
state = multi_turn_chat.run(
    system_msg="Analyze the following legal doc.",
    context="[... 4000 tokens of text ...]",
    question="What is the termination clause?"
)

If another user asks a different question about the same legal doc, the SGLang runtime recognizes that the system_msg and context tokens match the existing nodes in the Radix Tree and reuses them immediately.

Real-World Gotchas and Common Pitfalls

1. The "Small Prompt" Overhead

RadixAttention adds a small amount of CPU overhead to manage the tree structure and perform prefix matching. If your prompts are very short (e.g., < 50 tokens) and never repeat, the overhead of tree management might slightly decrease your throughput compared to a raw PagedAttention implementation. RadixAttention shines when prefixes are > 200 tokens.

2. Cache Poisoning/Pollution

In a multi-tenant environment, a single user sending thousands of unique, massive prompts could potentially flush the cache for everyone else (LRU eviction). You need to implement request-level rate limiting or cache quotas if you're building a public-facing API to ensure one "noisy neighbor" doesn't degrade the TTFT for everyone else.

3. Tokenizer Consistency

RadixAttention relies on token IDs. If you use different tokenization settings (e.g., adding/removing whitespace at the start of a prompt), the prefix match will fail. Always normalize your prompts before sending them to the inference engine.

4. VRAM Fragmentation over Time

While PagedAttention handles block fragmentation, RadixAttention can still suffer from "logical" fragmentation if you have thousands of very small, branchy nodes in your tree. Periodic cache cleaning or setting a max_cache_nodes limit is a good production practice.

The Future: Beyond the KV Cache

As we look toward Mastering Multi-Agent Orchestration for AI Workflows, RadixAttention becomes even more critical. Agents often operate in loops, re-reading the same history and scratchpad over and over.

We are also seeing the emergence of Speculative Decoding being integrated with RadixTrees. By knowing what is already in the cache, the engine can make better guesses about the next likely tokens, further reducing latency.

Next Steps

If you are currently using vLLM and your workload involves RAG or long-running conversations, I highly recommend benchmarking SGLang in a staging environment. The transition isn't just about a different library; it's about shifting your mindset from "stateless" requests to a "stateful" caching architecture.

Audit your prompts: Identify the common prefixes. Are they long enough to justify RadixAttention? (Usually, yes if > 512 tokens).
Monitor TTFT: This is where you will see the 10x gains. Don't just look at total throughput.
Optimize your context: Order your prompts so the most static parts (system instructions) come first, followed by the semi-static parts (retrieved docs), and the most dynamic parts (user query) come last.

Practical FAQ

Q: Does RadixAttention work across multiple GPUs? A: Yes, but it requires a distributed KV cache manager. In current implementations like SGLang, the Radix Tree is managed on the scheduler level. If you are using tensor parallelism, the cache is mirrored/split across the participating GPUs.

Q: Can I use RadixAttention with LoRA adapters? A: This is tricky. If different requests use different LoRA adapters, the KV cache isn't directly sharable because the underlying weights (and thus the keys/values) change. However, you can maintain separate Radix Trees per adapter or use "Multi-LoRA" serving where the prefix is shared but the delta is computed per-adapter.

Q: How does this interact with Context Window limits? A: RadixAttention doesn't increase the model's native context window (e.g., 8k or 128k). It only makes the management of tokens within that window more efficient. If a prefix is longer than the context window, it will still be truncated according to your model's policy.

Q: Is RadixAttention better for "One-Shot" tasks? A: Not necessarily. If every request to your API is completely unique and shares no common prompt or system instructions, PagedAttention is more than sufficient. RadixAttention is a specialized tool for workloads with high redundancy or multi-step reasoning.