Why FP8 Choice is the Difference Between 2x Throughput and Training Collapse

Title: Why FP8 Choice is the Difference Between 2x Throughput and Training Collapse Slug: fp8-e4m3-vs-e5m2-precision-comparison Category: Machine Learning MetaDescription: Stop guessing which FP8 format to use. Learn why E4M3 is for weights and E5M2 is for gradients, and how it impacts your H100/H200 throughput.

I spent three weeks debugging a gradient explosion in a large-scale MoE model that wasn't a learning rate issue, a weight initialization problem, or a data corruption bug—it was an exponent bias mismatch in our FP8 configuration. We were pushing for maximum throughput on H100s, but by using the "wrong" flavor of 8-bit floating point for the backwards pass, we effectively turned our gradients into noise. If you think "FP8 is just FP8," you’re likely leaving 40% of your hardware’s potential on the table or, worse, shipping a model that will hallucinate because its weights are fundamentally brittle.

TL;DR / Quick Takes

FP8-E4M3 is for precision. Use it for forward pass weights and activations where you need to represent small differences accurately.
FP8-E5M2 is for range. Use it for gradients in the backward pass to avoid the "overflow to infinity" death spiral.
Hardware support is non-negotiable. You need NVIDIA Hopper (H100/H200) or Blackwell architectures; trying to emulate this on A100s is a waste of time.
The Scaling Factor (Amax) is the real hero. Without dynamic scaling, both formats fail. It’s not just about the bits; it’s about how you shift the window.

The Anatomy of an 8-Bit Tug-of-War

When we moved from FP32 to FP16 (and eventually BF16), the trade-offs were simple: we traded dynamic range for memory bandwidth. But with the jump to FP8, we don't have enough bits to be "general purpose." We had to split the format into two distinct children: E4M3 and E5M2.

The names tell you the story. In E4M3, you have 1 sign bit, 4 bits for the exponent, and 3 bits for the mantissa (the fractional part). In E5M2, you give up one bit of precision (leaving only 2 for the mantissa) to gain a much larger exponent range.

The Bit Breakdown

Format	Sign	Exponent	Mantissa	Max Value (Approx)	Precision	Best For
FP8-E4M3	1	4	3	448	Higher	Weights, Activations
FP8-E5M2	1	5	2	57,344	Lower	Gradients
BF16	1	8	7	3.39e38	High	Master Weights

Think of it like this: E4M3 is a high-resolution camera with a narrow field of view. E5M2 is a low-resolution camera with a wide-angle lens. If you try to use the wide-angle lens (E5M2) for your model weights, you lose the subtle nuances that allow a model to differentiate between "cat" and "car." If you use the high-res camera (E4M3) for gradients, you’ll constantly "clip" the edges because gradients often swing wildly in magnitude.

Why E4M3 is the King of Inference

In production inference, throughput is the only metric that keeps the CFO happy. When we talk about optimizing MoE models for efficient resource inference, FP8 is usually the silver bullet.

E4M3 is almost always the choice here. Why? Because activations and weights are generally well-behaved. They tend to cluster around zero (if you're using proper normalization). The 3-bit mantissa in E4M3 provides 8 "steps" of precision between each power of two, compared to only 4 "steps" in E5M2. In LLMs, those extra steps are the difference between a coherent sentence and gibberish.

Honestly, I’ve seen people try to use E5M2 for inference because they were lazy with their scaling logic and wanted the "safety" of the larger exponent. Don't do that. You’re effectively turning your $30k GPU into a $5k one by sacrificing model quality for a lack of proper quantization code.

The Training Reality: Why Gradients Need E5M2

Training is a different beast. During the backward pass, the chain rule involves multiplying many small partial derivatives. These numbers can get very small—or very large—very quickly.

When I was fine-tuning open-source LLMs for domain-specific RAG, we experimented with pure E4M3 training to simplify the pipeline. It was a disaster. The gradients frequently hit the "448" limit of E4M3 and saturated. Once your gradients saturate, your model stops learning. It just vibrates in place.

E5M2, with its max value of 57,344, gives you the breathing room needed for those gradient spikes. It’s the "safety net" of the FP8 world.

Implementing FP8 in the Real World

You don't just cast a tensor to torch.fp8. (Well, you can, but it’ll suck). You need a library that handles the scaling factors. NVIDIA's TransformerEngine (TE) is the current industry standard.

Here is what a simplified FP8 linear layer looks like when you're actually trying to balance these two:

import torch
import transformer_engine.pytorch as te
from transformer_engine.common import recipe

# This is the "Recipe" - where the magic happens
# We tell TE to use E4M3 for the forward pass and E5M2 for the backward pass
fp8_recipe = recipe.DelayedScaling(
    margin=0, 
    interval=1, 
    fp8_format=recipe.Format.HYBRID  # E4M3 forward, E5M2 backward
)

# A standard Linear layer wrapped in TE
model = te.Linear(
    in_features=4096, 
    out_features=4096, 
    bias=True
)

# During the training loop
with te.fp8_autocast(enabled=True, fp8_recipe=fp8_recipe):
    output = model(input_tensor)
    loss = criterion(output, target)

# Backprop happens automatically with the E5M2 format for gradients
loss.backward()

⚠️ Gotcha: The Scaling Factor Lag

The biggest pitfall in FP8 training is the "Delayed Scaling" logic. To save compute, we don't calculate the maximum value (amax) of a tensor every single time we use it. Instead, we use the amax from the previous iteration.

If your model is in a highly volatile part of the loss landscape (like the first 100 steps of training), the amax from step 10 might be 0.5, but the actual values in step 11 might jump to 5.0. Because you’re using the old scale, you’ll massively over-saturate your values. I usually keep the scaling interval at 1 for the first 500 steps, even though it costs a bit of performance, then crank it up once the model stabilizes.

The Part Nobody Tells You: Hardware Jitter and Kernel Fusion

Everyone looks at the theoretical TFLOPS of an H100 (which are insane for FP8), but they forget about the "conversion tax."

Converting a BF16 tensor to FP8 isn't free. If you do it naively in PyTorch, you might actually see slower training than pure BF16 because you're spending all your time moving data between different memory formats. To actually see the 2x-3x speedup, you need Kernel Fusion.

In production, we don't just use FP8 for the math; we use it to reduce memory pressure on the KV-cache. If you're speeding up LLMs with speculative decoding, keeping your KV-cache in E4M3 allows you to fit significantly larger batch sizes. But here’s the kicker: FP8 KV-caches are incredibly sensitive to outliers. If one "token" has a massive activation value, it can squash the precision of every other token in that batch.

What I'd actually use in production:

For Inference: Pure E4M3 for weights and activations. Use per-channel scaling for weights and per-tensor scaling for activations.
For Training: The Hybrid approach. E4M3 for the forward pass, E5M2 for the backward pass. Keep your "master weights" in BF16. Don't even think about 8-bit master weights yet; the tech isn't stable enough for production reliability.

Comparisons: FP8 vs. The World

Feature	FP16/BF16	FP8 (Hybrid)	INT8
Throughput	Baseline (1x)	~2x - 3.5x	~3x
Training Stability	Rock Solid	Fragile (requires tuning)	Nearly Impossible
Ease of Use	Plug and Play	Requires Scaling Logic	Requires Calibration
Accuracy Loss	Negligible	<1% with scaling	Variable (2-5%)

The "Scaling Margin" Secret

Look, I'll be honest—setting the scaling margin is more of an art than a science. NVIDIA recommends a margin of 0, but I’ve found that in models with deep residual stacks (like 70B+ LLMs), a small margin (around 1.5) prevents the "rounding to zero" problem that haunts E4M3.

When values get too small, E4M3 just rounds them to zero. This is called "underflow." If too many activations underflow, your model effectively "dies"—it’s still running, but the signals aren't propagating. If you see your loss curve go perfectly flat, check your underflow rates before you touch your learning rate.

Practical FAQ

Q: Can I use FP8 on my A100 or RTX 3090? A: No. You can simulate it for research, but the hardware doesn't have the FP8 Tensor Cores. You'll actually run slower because of the software emulation overhead. Stick to BF16 or INT8 on Ampere.

Q: Does E4M3 affect hallucination rates? A: If done poorly, yes. If your scaling factors aren't updated frequently enough, the model loses its ability to distinguish between low-probability tokens. This usually manifests as the model getting "stuck" in a loop or generating repetitive phrases.

Q: Is FP8 better than 4-bit quantization (bitsandbytes)? A: For inference, 4-bit (NF4) is great for saving memory, but FP8 is significantly faster. FP8 is a hardware-native format; 4-bit usually requires a dequantization step during the forward pass which adds latency. If you have the VRAM, FP8 is the throughput king.

Q: Should I use FP8 for LoRA adapters? A: I wouldn't. LoRA adapters are small enough that the memory savings are negligible, but the precision loss in the adapter can disproportionately ruin the fine-tuning. Keep your adapters in BF16 and your base model in FP8.

What to Try Next

If you're ready to move beyond the theory, start by integrating TransformerEngine into a small script and compare the amax distributions of your weights vs. your gradients. You'll quickly see why the E4M3/E5M2 split exists.

Don't just take my word for it—try running a training run with pure E4M3 gradients. Watch the loss curve. When it inevitably explodes or plateaus, switch the backward pass to E5M2. That "Aha!" moment when the loss starts diving again is the best way to understand the necessity of dynamic range in modern AI.

If you're interested in how these precision shifts impact more complex architectures, check out my deep dive on implementing multi-agent orchestration frameworks, where memory management becomes even more critical than raw TFLOPS.

SocialQuote: "Using the wrong FP8 format is like trying to do surgery with a chainsaw. E4M3 for the forward pass precision, E5M2 for the backward pass range. Mix them up, and your model is toast."

KeyStat: Switching from BF16 to a Hybrid FP8 (E4M3/E5M2) configuration on H100 GPUs typically yields a 2.5x to 3.8x increase in training throughput with less than a 0.5% impact on final model perplexity.