Taming Outliers: A Comparative Deep-Dive into SVD-Quant and QuaRot for Production W8A8 Inference

Title: Taming Outliers: A Comparative Deep-Dive into SVD-Quant and QuaRot for Production W8A8 Inference Slug: svd-quant-vs-quarot-outlier-aware-quantization Category: LLM MetaDescription: Compare SVD-Quant vs. QuaRot for 8-bit weight-activation quantization. Learn how to handle LLM outliers for production-grade throughput and accuracy.
Quick Summary
If you are moving Large Language Models (LLMs) into production, you’ve likely hit the W8A8 quantization wall. While weight-only quantization (W8A16) is relatively straightforward, quantizing activations to 8-bit (INT8 or FP8) usually destroys model perplexity due to high-magnitude outliers in the activation tensors. This article compares two state-of-the-art strategies to solve this: QuaRot, which uses randomized rotations (Hadamard transforms) to "smear" outliers across dimensions, and SVD-Quant, which uses Singular Value Decomposition to peel off outliers into a high-precision low-rank branch.
- Choose QuaRot if you want a mathematically elegant way to eliminate outliers entirely without changing the model architecture, provided you can handle the overhead of online Hadamard transforms and custom kernels.
- Choose SVD-Quant if you prefer a "divide and conquer" approach that keeps the majority of the workload in standard INT8/FP8 while handling problematic outliers in a separate FP16 branch, offering better accuracy for models with extreme activation spikes.
The Outlier Problem: Why W8A8 is Hard
You’ve probably seen the benchmarks: FP16 inference is slow and memory-intensive, but INT8 weight-only quantization only solves the memory bandwidth bottleneck for the weights—it doesn’t help with compute-bound scenarios or KV cache scaling. To truly maximize throughput on NVIDIA A100s or H100s, you need Weight-Activation Quantization (W8A8).
The problem is the activation outliers. In models like Llama-3 or Mistral, certain channels in the activation tensors exhibit magnitudes 10x to 100x larger than the mean. If you use a simple per-tensor or per-token static scaling factor, these outliers force a large dynamic range, crushing the precision of the remaining 99.9% of the values into a few bit levels.
I’ve seen engineers try to "clip" these outliers, but that almost always results in a catastrophic drop in the model's reasoning capabilities. This is where QuaRot and SVD-Quant come in. They don't just ignore the outliers; they architecturally account for them.
QuaRot: Computational Sleight of Hand via Rotations
QuaRot (Quantization with Rotation) operates on a fascinating mathematical premise: outliers are often "axis-aligned." If you rotate the feature space using an orthogonal matrix, you can distribute the energy of those outliers across all dimensions.
How QuaRot Works
QuaRot applies a Walsh-Hadamard Transform (WHT) to the hidden states. Because the WHT is an orthogonal transformation, it preserves the L2 norm (and thus the informational content) while effectively "smearing" the spikes.
- Weight Rotation: During post-training quantization (PTQ), you rotate the weights offline.
- Activation Rotation: During inference, you rotate the activations online before the linear layers.
- Hadamard Fusion: To avoid the $O(N^2)$ cost of a full rotation matrix, QuaRot uses the Fast Hadamard Transform (FHT), which is $O(N \log N)$. This is typically fused into the preceding LayerNorm or RoPE (Rotary Positional Embedding) kernels.
This approach is particularly effective for Optimizing MoE Models for Efficient Resource Inference, where gating logic can be extremely sensitive to activation precision.
The Trade-offs of QuaRot
The biggest "Gotcha" with QuaRot is the kernel overhead. While $O(N \log N)$ is better than $O(N^2)$, the constants matter. If your fused kernels aren't highly optimized for your specific GPU architecture, the time saved by using INT8 GEMMs can be eaten up by the WHT operations. Furthermore, QuaRot requires rotating the KV cache, which adds complexity to the attention mechanism.
SVD-Quant: The "Divide and Conquer" Strategy
SVD-Quant takes a more direct approach. Instead of trying to hide the outliers, it identifies them using Singular Value Decomposition and treats them differently.
The Mechanics of SVD-Quant
SVD-Quant decomposes a weight matrix $W$ into two parts: a quantized base and a high-precision outlier branch. However, it does something clever with the activations. It observes that outliers in activations are often consistent across different inputs.
- Smoothing: It scales the weights to migrate the "difficulty" of activations into the weights (similar to SmoothQuant).
- SVD Decomposition: It identifies the top-k singular values/vectors that contribute most to the outlier behavior.
- Dual-Path Execution:
- Main Path: Most of the matrix multiplication happens in INT8/FP8.
- Outlier Path: A tiny fraction of the computation (the low-rank part representing the outliers) is performed in FP16.
- Summation: The results are fused at the end.
This is highly effective when Fine-Tuning Open-Source LLMs for Domain-Specific RAG, as the domain-specific data often introduces new, unpredictable activation patterns that standard PTQ might miss.
Why SVD-Quant Wins on Accuracy
Because SVD-Quant doesn't force the outliers into an 8-bit representation at all, it can maintain near-FP16 perplexity even at very aggressive quantization levels. You aren't "smearing" the error; you are removing the source of the error from the quantized calculation entirely.
Head-to-Head: Comparative Analysis
| Feature | QuaRot | SVD-Quant |
|---|---|---|
| Mathematical Basis | Orthogonal Transformations (Hadamard) | Low-rank Decomposition (SVD) |
| Hardware Affinity | High (utilizes standard INT8 Tensor Cores) | Moderate (requires dual-path execution) |
| Implementation Complexity | High (requires custom fused kernels) | Moderate (Standard SVD + Branching) |
| KV Cache Impact | Rotates KV cache (complex) | Standard quantization (simpler) |
| Accuracy Retention | Excellent | State-of-the-art |
| Best For | Massive-scale throughput on H100s | Precision-critical edge deployments |
Implementation Guide: Integrating SVD-Quant Logic
Implementing a full SVD-Quant kernel is beyond a single blog post, but we can look at the Pre-quantization Decomposition step which is the heart of the process. If you are building a custom inference engine, this is how you would prepare your weights.
import torch
def svd_quant_decompose(weight_matrix, ratio=0.01):
"""
Decomposes a weight matrix into a quantized base and an outlier rank-k component.
"""
# 1. Perform SVD
# weight_matrix shape: [out_features, in_features]
U, S, Vh = torch.linalg.svd(weight_matrix.to(torch.float32), full_matrices=False)
# 2. Determine rank k for outliers
k = int(weight_matrix.shape[1] * ratio)
# 3. Extract the low-rank 'outlier' component (FP16)
U_k = U[:, :k]
S_k = S[:k]
Vh_k = Vh[:k, :]
weight_outlier = U_k @ torch.diag(S_k) @ Vh_k
# 4. The 'residual' weight is what we quantize to INT8/FP8
weight_quantizable = weight_matrix - weight_outlier
return weight_quantizable, (U_k, S_k, Vh_k)
# Example usage in a mock linear layer
W = torch.randn(4096, 4096)
W_base, W_outliers = svd_quant_decompose(W, ratio=0.01)
# In production, W_base is converted to INT8/FP8
# W_outliers is stored as two smaller FP16 matrices (U_k * S_k and Vh_k)
During inference, you would compute: $Y = \text{QuantizedLinear}(X, W_{base}) + (X @ Vh_k^T) @ (U_k \cdot S_k)^T$
This looks like extra work, but because $k$ is very small (often $<1%$ of the hidden dimension), the second term is incredibly fast compared to the main GEMM.
Real-World "Gotchas" and Common Pitfalls
1. The "Fused Kernel" Trap
Both methods look great on paper, but if you implement them in pure PyTorch, they will be slower than FP16. Why? Because the overhead of launching multiple kernels for the outlier branch or the Hadamard transform kills your gains. To make these work in production, you must use Triton or CUDA to fuse the outlier addition or the rotation directly into the main GEMM or the activation function.
2. Numerical Stability in KV Cache
QuaRot rotates the KV cache. If you are using Speeding Up LLMs: A Guide to Speculative Decoding, the mismatch between the drafted tokens and the target model's rotated feature space can lead to divergence if the rotation isn't handled with high precision. Ensure your WHT is performed in FP32 or a very stable FP16 implementation.
3. Hardware Support
INT8 Tensor Cores (available since the Turing architecture) behave differently than the newer FP8 units on H100s. QuaRot's rotation strategy is specifically beneficial for INT8 because INT8 has a very narrow dynamic range. If you are on an H100 using FP8, the need for QuaRot is slightly diminished because FP8’s E4M3/E5M2 formats handle outliers slightly better than INT8—but SVD-Quant still provides a significant boost for 4-bit experiments.
Evaluation: When to Use Which?
If you are optimizing a model for a high-concurrency environment where every millisecond of latency counts (like a real-time chat API), QuaRot is generally the winner once the kernels are tuned. The ability to keep the inference path as a single "stream" of rotations and GEMMs is cleaner for hardware schedulers.
However, if you are dealing with a model that has "heavy hitters" (activations that are massively larger than others), SVD-Quant is safer. I’ve seen cases where QuaRot's "smearing" isn't enough to prevent 8-bit saturation, whereas SVD-Quant's dedicated FP16 path guarantees those critical signals are preserved. This is vital for complex reasoning tasks where a single bit-flip in a high-magnitude activation can change a "Yes" to a "No."
Next Steps for Production Implementation
- Profile your activations: Before choosing, run your model with a representative dataset and plot the distribution of activation magnitudes. If you see a few "spikes" that are 50x the mean, go with SVD-Quant. If the distribution is generally wide but without extreme spikes, QuaRot is your friend.
- Check your kernel library: If you are using vLLM or TensorRT-LLM, check their current support. Many of these libraries are beginning to integrate QuaRot-style rotations as a standard preprocessing step.
- Test at Scale: Quantization errors often hide in short benchmarks. Test your W8A8 model on long-context tasks (32k+ tokens) to ensure the accumulated error in the KV cache doesn't degrade performance over time.
Practical FAQ
Q: Can I combine SVD-Quant and QuaRot? A: Technically, yes, but the complexity-to-performance gain ratio is poor. Rotating the space already tries to solve what SVD-Quant solves via decomposition. Doing both adds unnecessary computational overhead.
Q: Does W8A8 quantization affect the speed of the Prefill or Decode phase more? A: W8A8 significantly improves the Prefill phase (which is compute-bound) by utilizing Tensor Cores more efficiently. For the Decode phase, it helps by reducing the KV cache memory bandwidth pressure, allowing for larger batch sizes.
Q: Is there a significant difference in power consumption? A: Yes. Moving from FP16 to INT8/FP8 compute on modern GPUs can reduce the energy-per-token significantly, as integer operations and low-precision floating point require less toggling of logic gates in the ALU.
Q: What about Llama-3 specifically? A: Llama-3 is known to have more "difficult" activation distributions than Llama-2. Preliminary results from the community suggest that SVD-Quant or SmoothQuant+ (a precursor to the logic in SVD-Quant) is almost mandatory for Llama-3 70B if you want to maintain 8-bit accuracy without fine-tuning.
Gulshan Sharma
AI/ML Engineer, Full-Stack Developer
AI engineer and technical writer passionate about making artificial intelligence accessible. Building tools and sharing knowledge at the intersection of ML engineering and practical software development.
Continue Reading

HQQ vs. AWQ: The Engineering Trade-offs of High-Precision Quantization in Production
A technical deep-dive into HQQ vs. AWQ. Learn when to use calibration-free HQQ over activation-aware AWQ for production inference and LLM optimization.
9 min read
From KV-Cache Bloat to Linear Scaling: Mamba-2 vs. Jamba in Production
Deep technical comparison of Mamba-2 and Jamba for long-context production serving. Learn how to bypass the KV cache bottleneck using SSM architectures.
8 min read
Linear vs. Quadratic: Choosing Mamba-2 or FlashAttention-2 for Production Long-Context LLMs
A technical comparison of Mamba-2 and FlashAttention-2 for long-context processing. Learn which architecture wins in production throughput and memory effic
8 min read