HomeBlog
Categories
AI Basics
Machine Learning
LLM
Prompt Engineering
AI Tools
AI for Developers

Taming Outliers: A Comparative Deep-Dive into SVD-Quant and QuaRot for Production W8A8 Inference

Gulshan Sharma
Published on May 4, 2026
Share:
Taming Outliers: A Comparative Deep-Dive into SVD-Quant and QuaRot for Production W8A8 Inference

Title: Taming Outliers: A Comparative Deep-Dive into SVD-Quant and QuaRot for Production W8A8 Inference Slug: svd-quant-vs-quarot-outlier-aware-quantization Category: LLM MetaDescription: Compare SVD-Quant vs. QuaRot for 8-bit weight-activation quantization. Learn how to handle LLM outliers for production-grade throughput and accuracy.

Quick Summary

If you are moving Large Language Models (LLMs) into production, you’ve likely hit the W8A8 quantization wall. While weight-only quantization (W8A16) is relatively straightforward, quantizing activations to 8-bit (INT8 or FP8) usually destroys model perplexity due to high-magnitude outliers in the activation tensors. This article compares two state-of-the-art strategies to solve this: QuaRot, which uses randomized rotations (Hadamard transforms) to "smear" outliers across dimensions, and SVD-Quant, which uses Singular Value Decomposition to peel off outliers into a high-precision low-rank branch.

  • Choose QuaRot if you want a mathematically elegant way to eliminate outliers entirely without changing the model architecture, provided you can handle the overhead of online Hadamard transforms and custom kernels.
  • Choose SVD-Quant if you prefer a "divide and conquer" approach that keeps the majority of the workload in standard INT8/FP8 while handling problematic outliers in a separate FP16 branch, offering better accuracy for models with extreme activation spikes.

The Outlier Problem: Why W8A8 is Hard

You’ve probably seen the benchmarks: FP16 inference is slow and memory-intensive, but INT8 weight-only quantization only solves the memory bandwidth bottleneck for the weights—it doesn’t help with compute-bound scenarios or KV cache scaling. To truly maximize throughput on NVIDIA A100s or H100s, you need Weight-Activation Quantization (W8A8).

The problem is the activation outliers. In models like Llama-3 or Mistral, certain channels in the activation tensors exhibit magnitudes 10x to 100x larger than the mean. If you use a simple per-tensor or per-token static scaling factor, these outliers force a large dynamic range, crushing the precision of the remaining 99.9% of the values into a few bit levels.

I’ve seen engineers try to "clip" these outliers, but that almost always results in a catastrophic drop in the model's reasoning capabilities. This is where QuaRot and SVD-Quant come in. They don't just ignore the outliers; they architecturally account for them.

QuaRot: Computational Sleight of Hand via Rotations

QuaRot (Quantization with Rotation) operates on a fascinating mathematical premise: outliers are often "axis-aligned." If you rotate the feature space using an orthogonal matrix, you can distribute the energy of those outliers across all dimensions.

How QuaRot Works

QuaRot applies a Walsh-Hadamard Transform (WHT) to the hidden states. Because the WHT is an orthogonal transformation, it preserves the L2 norm (and thus the informational content) while effectively "smearing" the spikes.

  1. Weight Rotation: During post-training quantization (PTQ), you rotate the weights offline.
  2. Activation Rotation: During inference, you rotate the activations online before the linear layers.
  3. Hadamard Fusion: To avoid the $O(N^2)$ cost of a full rotation matrix, QuaRot uses the Fast Hadamard Transform (FHT), which is $O(N \log N)$. This is typically fused into the preceding LayerNorm or RoPE (Rotary Positional Embedding) kernels.

This approach is particularly effective for Optimizing MoE Models for Efficient Resource Inference, where gating logic can be extremely sensitive to activation precision.

The Trade-offs of QuaRot

The biggest "Gotcha" with QuaRot is the kernel overhead. While $O(N \log N)$ is better than $O(N^2)$, the constants matter. If your fused kernels aren't highly optimized for your specific GPU architecture, the time saved by using INT8 GEMMs can be eaten up by the WHT operations. Furthermore, QuaRot requires rotating the KV cache, which adds complexity to the attention mechanism.

SVD-Quant: The "Divide and Conquer" Strategy

SVD-Quant takes a more direct approach. Instead of trying to hide the outliers, it identifies them using Singular Value Decomposition and treats them differently.

The Mechanics of SVD-Quant

SVD-Quant decomposes a weight matrix $W$ into two parts: a quantized base and a high-precision outlier branch. However, it does something clever with the activations. It observes that outliers in activations are often consistent across different inputs.

  1. Smoothing: It scales the weights to migrate the "difficulty" of activations into the weights (similar to SmoothQuant).
  2. SVD Decomposition: It identifies the top-k singular values/vectors that contribute most to the outlier behavior.
  3. Dual-Path Execution:
    • Main Path: Most of the matrix multiplication happens in INT8/FP8.
    • Outlier Path: A tiny fraction of the computation (the low-rank part representing the outliers) is performed in FP16.
    • Summation: The results are fused at the end.

This is highly effective when Fine-Tuning Open-Source LLMs for Domain-Specific RAG, as the domain-specific data often introduces new, unpredictable activation patterns that standard PTQ might miss.

Why SVD-Quant Wins on Accuracy

Because SVD-Quant doesn't force the outliers into an 8-bit representation at all, it can maintain near-FP16 perplexity even at very aggressive quantization levels. You aren't "smearing" the error; you are removing the source of the error from the quantized calculation entirely.

Head-to-Head: Comparative Analysis

Feature QuaRot SVD-Quant
Mathematical Basis Orthogonal Transformations (Hadamard) Low-rank Decomposition (SVD)
Hardware Affinity High (utilizes standard INT8 Tensor Cores) Moderate (requires dual-path execution)
Implementation Complexity High (requires custom fused kernels) Moderate (Standard SVD + Branching)
KV Cache Impact Rotates KV cache (complex) Standard quantization (simpler)
Accuracy Retention Excellent State-of-the-art
Best For Massive-scale throughput on H100s Precision-critical edge deployments

Implementation Guide: Integrating SVD-Quant Logic

Implementing a full SVD-Quant kernel is beyond a single blog post, but we can look at the Pre-quantization Decomposition step which is the heart of the process. If you are building a custom inference engine, this is how you would prepare your weights.

import torch

def svd_quant_decompose(weight_matrix, ratio=0.01):
    """
    Decomposes a weight matrix into a quantized base and an outlier rank-k component.
    """
    # 1. Perform SVD
    # weight_matrix shape: [out_features, in_features]
    U, S, Vh = torch.linalg.svd(weight_matrix.to(torch.float32), full_matrices=False)
    
    # 2. Determine rank k for outliers
    k = int(weight_matrix.shape[1] * ratio)
    
    # 3. Extract the low-rank 'outlier' component (FP16)
    U_k = U[:, :k]
    S_k = S[:k]
    Vh_k = Vh[:k, :]
    
    weight_outlier = U_k @ torch.diag(S_k) @ Vh_k
    
    # 4. The 'residual' weight is what we quantize to INT8/FP8
    weight_quantizable = weight_matrix - weight_outlier
    
    return weight_quantizable, (U_k, S_k, Vh_k)

# Example usage in a mock linear layer
W = torch.randn(4096, 4096)
W_base, W_outliers = svd_quant_decompose(W, ratio=0.01)

# In production, W_base is converted to INT8/FP8
# W_outliers is stored as two smaller FP16 matrices (U_k * S_k and Vh_k)

During inference, you would compute: $Y = \text{QuantizedLinear}(X, W_{base}) + (X @ Vh_k^T) @ (U_k \cdot S_k)^T$

This looks like extra work, but because $k$ is very small (often $<1%$ of the hidden dimension), the second term is incredibly fast compared to the main GEMM.

Real-World "Gotchas" and Common Pitfalls

1. The "Fused Kernel" Trap

Both methods look great on paper, but if you implement them in pure PyTorch, they will be slower than FP16. Why? Because the overhead of launching multiple kernels for the outlier branch or the Hadamard transform kills your gains. To make these work in production, you must use Triton or CUDA to fuse the outlier addition or the rotation directly into the main GEMM or the activation function.

2. Numerical Stability in KV Cache

QuaRot rotates the KV cache. If you are using Speeding Up LLMs: A Guide to Speculative Decoding, the mismatch between the drafted tokens and the target model's rotated feature space can lead to divergence if the rotation isn't handled with high precision. Ensure your WHT is performed in FP32 or a very stable FP16 implementation.

3. Hardware Support

INT8 Tensor Cores (available since the Turing architecture) behave differently than the newer FP8 units on H100s. QuaRot's rotation strategy is specifically beneficial for INT8 because INT8 has a very narrow dynamic range. If you are on an H100 using FP8, the need for QuaRot is slightly diminished because FP8’s E4M3/E5M2 formats handle outliers slightly better than INT8—but SVD-Quant still provides a significant boost for 4-bit experiments.

Evaluation: When to Use Which?

If you are optimizing a model for a high-concurrency environment where every millisecond of latency counts (like a real-time chat API), QuaRot is generally the winner once the kernels are tuned. The ability to keep the inference path as a single "stream" of rotations and GEMMs is cleaner for hardware schedulers.

However, if you are dealing with a model that has "heavy hitters" (activations that are massively larger than others), SVD-Quant is safer. I’ve seen cases where QuaRot's "smearing" isn't enough to prevent 8-bit saturation, whereas SVD-Quant's dedicated FP16 path guarantees those critical signals are preserved. This is vital for complex reasoning tasks where a single bit-flip in a high-magnitude activation can change a "Yes" to a "No."

Next Steps for Production Implementation

  1. Profile your activations: Before choosing, run your model with a representative dataset and plot the distribution of activation magnitudes. If you see a few "spikes" that are 50x the mean, go with SVD-Quant. If the distribution is generally wide but without extreme spikes, QuaRot is your friend.
  2. Check your kernel library: If you are using vLLM or TensorRT-LLM, check their current support. Many of these libraries are beginning to integrate QuaRot-style rotations as a standard preprocessing step.
  3. Test at Scale: Quantization errors often hide in short benchmarks. Test your W8A8 model on long-context tasks (32k+ tokens) to ensure the accumulated error in the KV cache doesn't degrade performance over time.

Practical FAQ

Q: Can I combine SVD-Quant and QuaRot? A: Technically, yes, but the complexity-to-performance gain ratio is poor. Rotating the space already tries to solve what SVD-Quant solves via decomposition. Doing both adds unnecessary computational overhead.

Q: Does W8A8 quantization affect the speed of the Prefill or Decode phase more? A: W8A8 significantly improves the Prefill phase (which is compute-bound) by utilizing Tensor Cores more efficiently. For the Decode phase, it helps by reducing the KV cache memory bandwidth pressure, allowing for larger batch sizes.

Q: Is there a significant difference in power consumption? A: Yes. Moving from FP16 to INT8/FP8 compute on modern GPUs can reduce the energy-per-token significantly, as integer operations and low-precision floating point require less toggling of logic gates in the ALU.

Q: What about Llama-3 specifically? A: Llama-3 is known to have more "difficult" activation distributions than Llama-2. Preliminary results from the community suggest that SVD-Quant or SmoothQuant+ (a precursor to the logic in SVD-Quant) is almost mandatory for Llama-3 70B if you want to maintain 8-bit accuracy without fine-tuning.

Gulshan Sharma

Gulshan Sharma

AI/ML Engineer, Full-Stack Developer

AI engineer and technical writer passionate about making artificial intelligence accessible. Building tools and sharing knowledge at the intersection of ML engineering and practical software development.