HQQ vs. AWQ: The Engineering Trade-offs of High-Precision Quantization in Production

Title: HQQ vs. AWQ: The Engineering Trade-offs of High-Precision Quantization in Production Slug: hqq-vs-awq-post-training-quantization-comparison Category: LLM MetaDescription: A technical deep-dive into HQQ vs. AWQ. Learn when to use calibration-free HQQ over activation-aware AWQ for production inference and LLM optimization.

If you are deploying Llama 3 70B or Mixtral 8x7B in a production environment, you have already accepted that FP16 is a luxury you cannot afford. To hit the required tokens-per-second (TPS) while keeping VRAM usage within the limits of an A100 or H100 cluster, quantization is mandatory. But the industry has moved past simple Round-To-Nearest (RTN) methods. The real battle for state-of-the-art (SOTA) inference efficiency currently pits AWQ (Activation-aware Weight Quantization) against HQQ (Half-Quadratic Quantization).

While AWQ has become the de facto standard for 4-bit weights due to its integration with vLLM and AutoAWQ, HQQ is rapidly gaining ground as a calibration-free alternative that often outperforms AWQ in both perplexity and flexibility. If you are choosing a quantization strategy for a high-throughput RAG pipeline or a latency-sensitive agentic workflow, you need to understand the underlying mechanics of how these two methods handle weight clipping and error minimization.

Quick Summary

AWQ protects the most "salient" 1% of weights based on activation magnitudes. It requires a calibration dataset and is heavily optimized for 4-bit inference through the Marlin and ExLlamaV2 kernels.
HQQ treats quantization as a mathematical optimization problem (Half-Quadratic) and requires no calibration data. It is significantly faster to execute, supports sub-4-bit widths (like 2-bit or 3-bit) with higher fidelity, and is ideal for domain-specific models where calibration data is unavailable or sensitive.
Recommendation: Use AWQ if you are using vLLM for standard 4-bit serving. Use HQQ if you are working with non-standard bit-widths, require zero-data quantization for privacy, or are optimizing MoE models for efficient resource inference where calibration bias can significantly degrade expert performance.

The AWQ Philosophy: Protecting the 1%

Activation-aware Weight Quantization (AWQ) is built on the observation that weight importance is not uniform. In any given linear layer, a small fraction of weights (roughly 1%) is responsible for the majority of the activation magnitude. If you quantize these "salient" weights poorly, the entire model’s performance collapses.

AWQ does not use traditional fine-tuning. Instead, it searches for an optimal scaling factor for these salient weights. By scaling up these important weights before quantization and scaling down the corresponding activations, AWQ reduces the relative quantization error where it matters most.

The Calibration Dependency

The primary "gotcha" with AWQ is its reliance on a calibration dataset (typically a subset of WikiText-2 or the Pile). The "salience" of a weight is determined by the activations it produces when processing this data. If your production use case involves highly specialized medical, legal, or financial jargon that differs significantly from the calibration set, the "salience" map will be wrong. This leads to increased perplexity and "hallucination-like" degradation in specialized domains.

When you are fine-tuning open-source LLMs for domain-specific RAG, using a generic AWQ-quantized base model can actually negate the gains of your fine-tuning if the quantization doesn't respect your domain's activation patterns.

The HQQ Approach: Optimization Without Data

Half-Quadratic Quantization (HQQ) takes a fundamentally different path. It treats the search for the quantized weight matrix $W_q$ and the scaling factors as a mathematical optimization problem. Specifically, it decomposes the quantization error into a form that can be solved using the Half-Quadratic solver.

Because HQQ is data-free, it doesn't care about activations. It optimizes the weights to be as close to the original distribution as possible within the constraints of the target bit-width.

Why Data-Free Matters for Engineers

Speed: Quantizing a 70B model with AWQ can take 30–60 minutes depending on your hardware and calibration set size. HQQ can do it in minutes because it only performs local optimizations on the weight matrices.
Privacy: If you are working in a highly regulated environment (e.g., healthcare), you might not have permission to use "representative" data for calibration on a local machine. HQQ bypasses this entirely.
Stability: HQQ is less prone to the "outlier" problem where a specific calibration prompt causes the quantizer to over-index on certain weight channels at the expense of others.

Implementation Guide: AWQ vs. HQQ

Let's look at how you would actually implement these in a Python environment. For AWQ, we typically use AutoAWQ. For HQQ, we use the hqq library.

Quantizing with AWQ

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "meta-llama/Meta-Llama-3-8B"
quant_path = "llama-3-8b-awq"
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }

# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Quantize - This requires a calibration dataset internally
model.quantize(tokenizer, quant_config=quant_config)

# Save the quantized model
model.save_quantized(quant_path)

Quantizing with HQQ

HQQ is much more flexible regarding bit-depth. While AWQ is mostly 4-bit, HQQ handles 2, 3, 4, and 8-bit configurations natively.

from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer
from hqq.core.quantize import BaseQuantizeConfig

model_id = "meta-llama/Meta-Llama-3-8B"
# Define 3-bit quantization with a group size of 64
quant_config = BaseQuantizeConfig(nbits=3, group_size=64)

# Load and quantize on-the-fly
model = HQQModelForCausalLM.from_pretrained(model_id)
model.quantize_model(quant_config=quant_config)

# HQQ is now ready for inference or saving

Performance Benchmarks: Perplexity and Throughput

In production, we care about two things: Perplexity (how much "smarter" the model remains) and Throughput (tokens per second).

Perplexity Analysis

In side-by-side tests with Llama-2 and Llama-3, HQQ often shows a slight edge in perplexity at the 4-bit level, but it dominates at the 2-bit and 3-bit levels. This is because AWQ’s reliance on activation salience starts to break down when the "budget" for bits is too low to accurately represent the non-salient weights.

HQQ's optimization-based approach ensures that even at 2-bit, the weights are mathematically optimized to minimize the Frobenius norm of the error. If you are scaling test-time compute and need the smallest possible model footprint to fit more reasoning cycles into your VRAM, HQQ's 3-bit performance is a game-changer.

Throughput and Kernel Support

This is where AWQ currently wins. Because AWQ has been around longer, it has superior kernel support:

vLLM Integration: AWQ is a first-class citizen in vLLM. It uses the Marlin kernel, which is a highly optimized 4-bit quantization kernel designed for NVIDIA GPUs that approaches FP16 speeds.
HQQ Kernels: HQQ originally relied on PyTorch’s dequantization, which was slow. However, with the introduction of the BitBlas and Marlin backends for HQQ, the gap is closing. When using HQQ with the BitBlas backend, you can achieve throughput almost identical to AWQ.

Common Pitfalls and "Gotchas"

1. The "Calibration Shift" in AWQ

If you quantize a model using AWQ on English text and then try to use it for a heavy coding task (Python/C++), you will see a significant drop in logic accuracy compared to the FP16 base. This is because the "salient" weights for natural language are not the same as the "salient" weights for logical syntax. The Fix: Use a mixed calibration set (e.g., 50% Pile, 50% CodeAlpaca) if you go the AWQ route.

2. HQQ and VRAM Overhead

HQQ models saved in the .hqq format are very efficient, but if you are loading them via Hugging Face wrappers without using the specialized backends (like BitBlas), you might find that the peak VRAM during dequantization is higher than you expected. The Fix: Always ensure you are using HQQBackend.BITBLAS or the specialized Marlin kernels for production inference to keep the memory footprint stable.

3. Layer-Wise Sensitivity

Not all layers in a Transformer are created equal. Both AWQ and HQQ allow for group sizes (usually 64 or 128). A smaller group size means higher precision but more VRAM usage for the scales/zeros. Pitfall: Engineers often use a global group size of 128. However, the output projections and the down-projections in the MLP blocks are much more sensitive to quantization error. The Fix: Consider using a group size of 64 for MLP layers and 128 for self-attention layers if your framework allows for heterogeneous quantization.

Beyond 4-bit: The Rise of 2-bit and 3-bit Inference

For edge devices or extreme-scale MoE models, 4-bit is sometimes still too heavy. This is where HQQ's mathematical robustness shines. In my experience, a 3-bit HQQ model often outperforms a 4-bit RTN (Round-To-Nearest) model while being 25% smaller.

When you are optimizing LLM inference with speculative decoding, you often need a "draft model" that is extremely fast. Using a 2-bit or 3-bit HQQ-quantized version of your main model as the draft model is an excellent strategy. Since it shares the same vocabulary and similar activation patterns as the base model, the acceptance rate for speculative tokens remains high.

Technical Deep Dive: The Half-Quadratic Optimization

If you want to understand why HQQ works, you have to look at the objective function. HQQ minimizes: $$ \min_{Q, S, Z} | W - S(Q - Z) |^2_2 $$ Where:

$W$ is the original weight matrix.
$S$ is the scaling factor.
$Q$ is the quantized integer.
$Z$ is the zero-point.

Unlike GPTQ, which uses the Hessian (second-order derivative) of the loss, HQQ uses the Half-Quadratic splitting method. It introduces an auxiliary variable that allows the solver to alternate between optimizing the "integer" part and the "floating point" part. This is why it doesn't need data; it’s solving a purely geometric problem in the weight space.

Choosing Your Backend: vLLM vs. TGI vs. AutoGPTQ

If your stack is built on vLLM, AWQ is the path of least resistance. The vllm engine has internal C++ kernels for AWQ that are extremely well-vetted.

If your stack is more custom, or if you are deploying on the edge (e.g., using llama.cpp or custom Triton kernels), HQQ offers a cleaner mathematical path. For those using AutoGPTQ, note that while it supports AWQ, it doesn't support HQQ natively yet, meaning you’ll need to use the hqq library’s own integration tools.

Practical FAQ

Can I fine-tune a model after it has been quantized with HQQ or AWQ?

Technically, you can use QLoRA (Quantized Low-Rank Adaptation) on both. However, AWQ-quantized models are "warped" by the calibration data. If you use QLoRA on an AWQ model with a very different dataset than the one used for calibration, you might encounter gradient instability. HQQ is generally more stable for post-quantization fine-tuning because the weight distribution hasn't been biased toward a specific calibration set.

How do HQQ and AWQ handle MoE (Mixture of Experts) architectures?

MoE models like Mixtral are notoriously difficult to quantize because "expert" weights are only used sporadically. AWQ struggles here because it's hard to get a "representative" calibration set that activates all experts equally. HQQ is vastly superior for MoEs because it quantizes each expert independently and mathematically, ensuring that rarely-used experts aren't destroyed by calibration bias.

Is there a significant latency difference between 3-bit and 4-bit?

In theory, yes. In practice, it depends on the kernel. Most modern GPUs are optimized for 4-bit (INT4) or 8-bit (INT8) operations. 3-bit quantization often requires "bit-packing," where 10 elements are packed into a 32-bit integer. The overhead of unpacking can sometimes eat up the gains from reduced memory bandwidth. Only use 3-bit if you are VRAM-constrained; for pure speed, 4-bit with a Marlin kernel is usually the sweet spot.

Next Steps

If you are starting a new project today:

Benchmarking: Start with 4-bit AWQ as your baseline. It is the most "production-ready" in terms of tooling.
Edge Cases: If your perplexity is too high or your data is too "weird" for standard calibration sets, switch to HQQ.
Optimization: If you need to fit a model into a specific VRAM footprint (like fitting a 30B model on a single 24GB consumer card), look at HQQ 3-bit with BitBlas.

Quantization is no longer a "one-size-fits-all" step. By choosing the right method—AWQ for standard deployments and HQQ for high-precision or data-sensitive tasks—you can significantly reduce your inference costs without sacrificing the intelligence of your system. For more on high-scale deployment strategies, check out our guide on implementing multi-agent orchestration frameworks to see how quantized models perform in complex autonomous workflows.