Continuous Batching Isn't Enough: Why Chunked Prefill is the Key to Scaling Low-Latency LLM Inference

Title: Continuous Batching Isn't Enough: Why Chunked Prefill is the Key to Scaling Low-Latency LLM Inference Slug: chunked-prefill-vs-continuous-batching-llm-latency Category: LLM MetaDescription: Stop letting long prompts kill your inference speed. Learn how chunked prefill and continuous batching trade-off to minimize Time-to-First-Token.
I spent three weeks benchmarking vLLM against TGI just to realize that our "latency issue" wasn't about the hardware, but about how we were handling the prefill-decode bottleneck. You don't have to burn three weeks of GPU compute credits to figure this out; the trade-off between chunked prefill and continuous batching is actually quite simple once you see the math. If you're running Llama 3 or Mistral in production and your users are complaining about that awkward 3-second pause before the first word appears, your scheduler is likely the culprit, not your model weights.
TL;DR / Quick Takes
- Continuous Batching solves the "bubble" problem by inserting new requests the moment a token is generated, but it still suffers from Head-of-Line blocking when a massive system prompt enters the queue.
- Chunked Prefill breaks down large input prompts into smaller pieces, allowing the GPU to interleave prefill work with decoding work for existing requests.
- The Trade-off: Continuous batching maximizes throughput (tokens/sec total), while chunked prefill optimizes for TTFT (Time-to-First-Token) and ITL (Inter-Token Latency) stability.
- What to use: If your prompts are consistently under 512 tokens, standard continuous batching is fine. If you’re doing RAG with 10k+ token contexts, you need chunked prefill or your TTFT will spike into the double digits.
The Architecture of the Bottleneck
To understand why we're even talking about this, we have to look at how what are Large Language Models actually function under the hood of a GPU. Inference happens in two distinct phases: Prefill and Decode.
- Prefill: The model processes the entire input prompt at once. This is a compute-bound operation. The GPU loves this because it can saturate its Tensor Cores with large Matrix-Matrix multiplications (GEMM).
- Decode: The model generates one token at a time. This is a memory-bound operation. The GPU hates this because it spends more time moving data from VRAM to the cores than it does actually calculating the next token (GEMV).
In the early days (think early 2023, which feels like a decade ago in AI years), we used static batching. You'd wait for everyone in the batch to finish before starting the next. It was incredibly inefficient. Then came Continuous Batching (popularized by papers like Orca and implementations like vLLM).
Continuous batching allows us to inject a new request into the batch as soon as any previous request finishes a token. It’s like a subway train that lets people hop on and off at every stop rather than waiting at the terminal for the whole train to empty. But there’s a catch (there’s always a catch).
When a new request arrives with a 4,000-token prompt, the GPU has to "prefill" that entire prompt before it can generate the first token. While the GPU is busy doing that massive compute-heavy prefill, the "decode" operations for all the other users currently in the batch have to wait. This causes a massive spike in Inter-Token Latency (ITL). Your existing users see the text stop for two seconds, and your new user waits forever for their first token.
Enter Chunked Prefill: The Scheduler's Scalpel
Chunked prefill (sometimes referred to as "Sarathi-style" scheduling) changes the rules. Instead of treating a 4,000-token prompt as one giant block of work, we split it into smaller "chunks"—say, 512 tokens each.
By doing this, we can mix a chunk of a new user's prefill with the single-token decodes of existing users in the same GPU iteration. This levels out the compute load. You’re essentially "sneaking in" pieces of the prompt prefill while keeping the token generation stream steady for everyone else.
Comparison: Performance Metrics in Production
| Metric | Continuous Batching (Vanilla) | Chunked Prefill |
|---|---|---|
| TTFT (Small Prompts) | Excellent | Excellent |
| TTFT (Large Prompts) | Poor (High Variance) | Good (Consistent) |
| Inter-Token Latency (ITL) | Spiky (Stutters when new users join) | Smooth (Stable) |
| Total Throughput | Very High | Slightly Lower (5-10% overhead) |
| VRAM Management | PagedAttention helps, but still risky | Better utilization of "bubbles" |
Honestly, I think the "slightly lower throughput" of chunked prefill is a total red herring. In production, a 5% drop in total tokens per second is a price I’ll pay every day to avoid a 500% spike in latency that makes the UI feel broken.
Implementing Chunked Prefill (The "How-To")
If you're using vLLM (version 0.4.3 or later), they've integrated this. You don't have to write custom CUDA kernels to get this working. You just need to tweak your engine arguments.
# Example of how you'd configure a vLLM engine for chunked prefill
from vllm import LLM, SamplingParams
# The magic happens with 'enable_chunked_prefill'
# max_num_batched_tokens determines the chunk size
llm = LLM(
model="meta-llama/Meta-Llama-3-8B-Instruct",
enable_chunked_prefill=True,
max_num_batched_tokens=512, # This limits the size of each prefill chunk
max_model_len=8192,
gpu_memory_utilization=0.9
)
sampling_params = SamplingParams(temperature=0.7, top_p=0.95, max_tokens=100)
# When you submit a long prompt now, it won't stall the whole engine
prompts = ["Explain quantum physics..." * 500] # A very long prompt
outputs = llm.generate(prompts, sampling_params)
⚠️ Gotcha: If you set max_num_batched_tokens too low (like 128), you'll lose the compute efficiency of the prefill phase. The GPU won't have enough work per chunk to reach its peak TFLOPS. If you set it too high (like 4096), you're basically back to vanilla continuous batching. 512 or 1024 is usually the "Goldilocks" zone for A100/H100 cards.
Why TTFT is the Only Metric Your Users Care About
We talk a lot about "tokens per second," but if you're building a chat app or an agentic workflow, the user's perception of "speed" is almost entirely tied to TTFT.
Think of it like a restaurant. If you sit down and wait 20 minutes for water, you’re annoyed. If the water arrives in 30 seconds, you’re willing to wait a bit longer for the steak. In LLM terms, getting those first few tokens on the screen immediately buys you "patience equity" from the user.
If you are optimizing MoE models for efficient resource inference, this becomes even more critical because the routing overhead already adds a layer of latency. Chunked prefill helps hide that.
The Part Nobody Tells You: The Preemption Tax
Here is the "real talk" moment: both of these methods fall apart if you oversubscribe your VRAM.
When you use PagedAttention (the backbone of vLLM and similar engines), you're managing a KV cache in blocks. If the engine gets too many requests and runs out of blocks, it has to perform "preemption." It literally stops a request, kicks its KV cache out to CPU RAM (or just deletes it), and restarts it later.
Continuous batching with chunked prefill makes it easier to oversubscribe because the engine feels "smoother," so you're tempted to ramp up the Request Rate. But chunked prefill actually increases the time a request spends "living" in the GPU memory because the prefill phase is stretched out over multiple iterations.
Look, I'll be honest — I've seen teams switch to chunked prefill, see their ITL drop, get cocky, double their traffic, and then wonder why their "Preemption Rate" suddenly skyrocketed. You still need to monitor your KV cache occupancy like a hawk.
Advanced Strategies: Speculative Decoding
If you've implemented chunked prefill and you still need more speed, your next move isn't hardware — it's speeding up LLMs with speculative decoding.
By using a smaller "draft" model to predict the next few tokens and then using the larger model (like Llama-3-70B) to verify them in a single chunked prefill-style pass, you can get a 2x-3x speedup on ITL. When you combine chunked prefill with speculative decoding, you’re essentially running the most optimized inference stack currently possible in 2024.
Comparison Table: When to Use What?
| Use Case | Best Strategy | Why? |
|---|---|---|
| Internal Batch Processing | Vanilla Continuous Batching | Maximize throughput; nobody cares about latency. |
| Customer Chatbot (Short Prompts) | Continuous Batching | Low overhead; TTFT is naturally low. |
| RAG / Document Analysis | Chunked Prefill | 10k+ token prompts will otherwise freeze the engine. |
| Coding Assistants | Chunked Prefill | Large file contexts need stable streaming speed. |
| Edge AI / Small Models | Simple Batching | Overhead of complex schedulers often isn't worth it. |
Practical FAQ
Q: Does chunked prefill work with quantization (like AWQ or FP8)? Absolutely. In fact, it’s almost mandatory. When you use FP8, your compute is even faster, meaning the "Decode" phase becomes even more of a relative bottleneck compared to the "Prefill" phase. Chunked prefill helps balance that.
Q: What happens if my chunk size is larger than the prompt? Then it just behaves like regular continuous batching. The scheduler sees that the prompt fits in one chunk and processes it. There's no penalty other than a few microseconds of scheduling logic.
Q: Is there any hardware that doesn't benefit from this? If you're running on hardware with very low compute-to-memory-bandwidth ratios (like some older consumer GPUs or certain edge NPUs), the "compute-bound" nature of prefill is less pronounced. But for A100s, H100s, and L40s, chunking is a massive win.
Q: How does this interact with Prefix Caching? This is where it gets interesting. If you have a common system prompt (a "prefix") that is cached, chunked prefill only applies to the new part of the prompt. This is the ultimate "low-latency" setup: cache the static parts, chunk the dynamic user input.
What to Try Next
If you’re running a production service, don't just take my word for it. Set up a Prometheus dashboard and track vllm:iteration_tokens_failing_to_be_scheduled and vllm:time_to_first_token_seconds.
Turn on chunked prefill with a max_num_batched_tokens of 512. Watch your tail latency (P99 TTFT). If it drops without your throughput cratering, you’ve found the sweet spot.
In my experience, the move from "standard" serving to "chunked" serving is the difference between a product that feels like a toy and a product that feels like a professional tool. Don't let your GPU's raw power be throttled by a naive scheduler.
SocialQuote: "Most people blame their GPU or model size for high latency, but the real killer is Head-of-Line blocking. Chunked prefill is how you actually fix TTFT for long-context RAG."
KeyStat: Switching to chunked prefill can reduce P99 Time-to-First-Token (TTFT) by up to 3x in high-concurrency environments with large prompt contexts.
Gulshan Sharma
AI/ML Engineer, Full-Stack Developer
AI engineer and technical writer passionate about making artificial intelligence accessible. Building tools and sharing knowledge at the intersection of ML engineering and practical software development.
Continue Reading

Medusa vs. EAGLE: Why Your Speculative Decoding Strategy is Probably Killing Your Throughput
Stop guessing which speculative decoding method is faster. A deep comparison of Medusa vs. EAGLE for production LLM serving with real-world benchmarks.
10 min read
Why FP8 Choice is the Difference Between 2x Throughput and Training Collapse
Stop guessing which FP8 format to use. Learn why E4M3 is for weights and E5M2 is for gradients, and how it impacts your H100/H200 throughput.
10 min read
Matryoshka vs. Binary Quantization: How to Scale to a Billion Vectors Without Killing Your Budget
Stop overpaying for vector RAM. Compare Matryoshka Representation Learning and Binary Quantization for efficient, billion-scale search in production.
9 min read