** DeepSpeed ZeRO-3 vs. PyTorch FSDP: Which One Actually Scales on "Shitty" Interconnects?

Title: DeepSpeed ZeRO-3 vs. PyTorch FSDP: Which One Actually Scales on "Shitty" Interconnects? Slug: deepspeed-zero3-vs-pytorch-fsdp-distributed-training-interconnect Category: Machine Learning MetaDescription: ZeRO-3 or FSDP? Gulshan Sharma breaks down the performance differences for large-scale training on clusters without NVLink or InfiniBand.
I spent three weeks debugging a training run that hung every 400 steps on a cluster of A100s that—honestly—had the networking backbone of a wet noodle. We were trying to fine-tune a 70B parameter model on a budget-tier cloud provider where "high-speed interconnect" turned out to be a marketing euphemism for standard 25Gbps Ethernet. While the tutorials all make it look as easy as flipping a boolean flag, the reality is that when your GPUs can't talk to each other faster than you can yell across a room, the choice between DeepSpeed ZeRO-3 and PyTorch FSDP becomes a matter of "will this finish in my lifetime?" rather than "which is slightly faster?"
TL;DR / Quick Takes
- DeepSpeed ZeRO-3 is the king of "I don't have enough VRAM." If you need to offload parameters to CPU or NVMe to fit a massive model, DeepSpeed is your only sane choice.
- PyTorch FSDP (Fully Sharded Data Parallel) is generally faster and more "Pythonic" if your interconnect is even halfway decent, but it's much pickier about how you wrap your model layers.
- On constrained interconnects (10-25Gbps): DeepSpeed often wins because its communication-overlapping logic is more mature, though the configuration is a nightmare of JSON files.
- The "Magic" Number: If your model parameters exceed the aggregate VRAM of your nodes, use ZeRO-3 with Offload. If they fit, but only just barely, use FSDP with
sharding_strategy=SHARD_GRAD_OP.
The Mechanics of Sharding (A Buffet Analogy)
Before we look at the code, think of your model like a massive 100-dish buffet. In standard Data Parallel (DP), every single GPU worker has to hold all 100 dishes in their hands (VRAM). Obviously, that's impossible for a 70B model.
In ZeRO-3 and FSDP, we shard. We give Worker A dishes 1-25, Worker B dishes 26-50, and so on. When Worker A needs to compute something using dish 51, it has to scream across the network to Worker C: "Hey, send me dish 51!"
On a cluster with NVLink (high-speed interconnect), Worker C tosses the dish over instantly. On a constrained cluster (standard Ethernet), Worker C has to pack it in a box, mail it, and Worker A waits. This "waiting" is what kills your TFLOPS. Both DeepSpeed and FSDP try to hide this waiting by "prefetching" the next dish while you're still eating the current one.
DeepSpeed ZeRO-3: The Swiss Army Knife (With Too Many Blades)
DeepSpeed has been the industry standard for a while, mostly because it was first to the party with ZeRO (Zero Redundancy Optimizer). ZeRO-3 takes sharding to the extreme: it shards weights, gradients, and optimizer states.
What I Like: CPU Offloading
This is the "killer feature" for constrained environments. If your interconnect is slow, you're already taking a performance hit. ZeRO-3 allows you to offload the optimizer states (which take up the most memory) to the system RAM. This is often necessary when fine-tuning open-source LLMs for domain-specific RAG where the model barely fits in VRAM even without the optimizer overhead.
The Code: That Infamous Config
DeepSpeed relies on a JSON config. It's clunky, but it works. Here is what a production-ready "constrained cluster" config looks like for a 70B model:
{
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": true, // Crucial for slow networks
"contiguous_gradients": true,
"sub_group_size": 1e9, // Don't let this be too small on slow networks
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_prefetch_bucket_size": 5e8,
"stage3_param_persistence_threshold": 1e6
},
"gradient_clipping": 1.0,
"steps_per_print": 10,
"train_batch_size": "auto",
"wall_clock_breakdown": false
}
⚠️ Gotcha: If you set overlap_comm to true but your sub_group_size is too small, you'll flood your narrow 25Gbps pipe with thousands of tiny packets. This leads to "network congestion" that can actually make your training slower than ZeRO-2. I usually start with 1e9 and tune down only if I hit OOM (Out of Memory).
PyTorch FSDP: The Native Contender
FSDP is PyTorch’s answer to DeepSpeed. Since it's native, it feels much more like writing standard PyTorch. You don't have a separate JSON; you wrap your model in a Python context.
The Performance Edge
In my experience, FSDP has a slightly more efficient implementation of the All-Gather and Reduce-Scatter collectives. When we were optimizing MoE models for efficient resource inference, we found that FSDP’s backward_prefetch policy was smarter at predicting which shards were needed next compared to DeepSpeed's more static approach.
The Code: Implementation
from torch.distributed.fsdp import (
FullyShardedDataParallel as FSDP,
MixedPrecision,
BackwardPrefetch,
ShardingStrategy,
)
from torch.distributed.fsdp.wrap import transformer_auto_wrap_policy
# This is where people mess up. You MUST define a wrap policy.
# If you don't, FSDP shards the whole model as one giant block,
# and you get zero memory savings.
my_auto_wrap_policy = functools.partial(
transformer_auto_wrap_policy,
transformer_layer_cls={
LlamaDecoderLayer, # Replace with your model's layer class
},
)
model = FSDP(
base_model,
auto_wrap_policy=my_auto_wrap_policy,
sharding_strategy=ShardingStrategy.FULL_SHARD, # This is ZeRO-3 equivalent
backward_prefetch=BackwardPrefetch.BACKWARD_PRE, # Overlap comms
device_id=torch.cuda.current_device(),
mixed_precision=MixedPrecision(param_dtype=torch.bfloat16),
)
⚠️ Gotcha: FSDP is notorious for being "fragile" during checkpointing. If you're on a constrained cluster, saving a 140GB checkpoint (for a 70B model in fp16) over a slow network can time out the distributed group. You often have to use StateDictType.SHARDED_STATE_DICT to save individual rank shards, then stitch them together later. It’s a pain.
The Interconnect Constraint: Where the Rubber Meets the Road
When you are on a cluster with no NVLink (e.g., AWS g5 instances or generic local servers), the "All-Gather" operation becomes your primary bottleneck.
Benchmarking the Lag
| Metric | DeepSpeed ZeRO-3 | PyTorch FSDP |
|---|---|---|
| Mem Efficiency | Excellent (with NVMe offload) | Good (CPU offload only) |
| Ease of Use | Moderate (JSON hell) | High (Native PyTorch) |
| Throughput (10Gbps) | Higher (Better overlapping) | Lower (Higher overhead) |
| Throughput (100Gbps+) | Lower | Higher |
| Stability | Rock solid | Occasional "NCCL Timeout" |
If you're training on a budget, you're likely dealing with high latency. DeepSpeed's ZeRO-3 allows for "Communication Bucketing." It groups small parameter transfers into larger chunks. On a slow interconnect, this is the difference between 5% GPU utilization and 40% GPU utilization.
FSDP has similar functionality, but I've found it harder to tune. If you aren't careful, FSDP will try to be "too smart" with speculative decoding-like prefetching of weights, and if the network can't keep up, the CPU just sits there spinning its wheels waiting for NCCL to finish.
What I’d Actually Use in Production
Look, I'll be honest — if I'm on a "pro-sumer" cluster (RTX 4090s or A6000s without NVLink bridges), I use DeepSpeed ZeRO-3.
Why? Because things go wrong. DeepSpeed has better logging for when the network hangs. It tells you exactly which rank is lagging. FSDP often just gives you a generic RuntimeError: [NCCL Work Timeout]. When you're 48 hours into a 72-hour run, you want the tool that helps you debug the failure, not just the one that’s 5% faster in a clean lab setting.
However, if you're building a pipeline that needs to be future-proof and you expect to eventually migrate to H100s with InfiniBand, start with FSDP. The transition from DistributedDataParallel (DDP) to FSDP is much smoother than integrating the whole DeepSpeed ecosystem.
The Part Nobody Tells You: The "All-Gather" Wall
There is a point where no amount of software optimization can save you. If your model is so large that every forward pass requires moving 50GB of data across a 10Gbps link, you are spending 40 seconds on communication for every 1 second of computation.
In these cases, ZeRO-3/FSDP is actually the wrong choice. You should look into Pipeline Parallelism (PP).
Wait, why? Because PP only communicates at the boundaries between stages (the layers). Instead of sharing everything all the time, Worker A finishes layers 1-10 and just sends the activations (a few megabytes) to Worker B. It’s much more "network-friendly." The downside? It’s much harder to implement without introducing "bubbles" where GPUs sit idle. But on a cluster with terrible interconnects, a 20% "bubble" is better than a 90% "communication wait."
This is particularly true when you are optimizing RLHF vs. DPO pipelines where you might have four different models (Actor, Critic, Reference, Reward) living on the same cluster.
Practical FAQ
1. Can I mix DeepSpeed and FSDP?
Absolutely not. They both try to hijack the same underlying NCCL (Nvidia Collective Communications Library) process group. Pick one and stick with it. If you're using a library like Hugging Face accelerate, it makes switching between them a one-line config change, which I highly recommend for testing.
2. My training is slow, but I have 100Gbps networking. Why?
Check your CPU. In ZeRO-3/FSDP, the CPU is responsible for managing the "sharding" logic and orchestrating the transfers. If you have a weak CPU (common in some "GPU-heavy" cloud instances), your CPU might be the bottleneck, not the network. Also, check if you've enabled pin_memory: true in your DeepSpeed config; without it, data transfers from CPU to GPU are painfully slow.
3. Does ZeRO-3 work with LoRA?
Yes, but it's often overkill. If you're doing PEFT/LoRA fine-tuning, your optimizer states are tiny. You're usually better off using ZeRO-2 or even standard DDP, as the communication overhead of ZeRO-3 will outweigh the memory savings.
4. How do I prevent NCCL Timeouts?
On constrained clusters, increase your timeout limit. In PyTorch, you can do this via:
dist.init_process_group(backend="nccl", timeout=datetime.timedelta(seconds=5400))
An hour and a half might seem insane, but when the network gets congested during a massive checkpoint save, you'll be glad you didn't crash.
What to Try Next
If you're stuck on a slow cluster, don't just throw more GPUs at it. Start by profiling your communication-to-computation ratio. Use the PyTorch Profiler or DeepSpeed’s built-in flops profiler. If you find you're spending more than 50% of your time in nccl_allgather, it’s time to either:
- Switch to ZeRO-2 (if VRAM allows).
- Aggressively increase your micro-batch size.
- Look into gradient accumulation to reduce the frequency of communication.
Training at scale is rarely about the "best" framework and almost always about managing the hardware constraints you were handed. Happy training.
SocialQuote: "Stop chasing the 'fastest' framework. On constrained clusters, the best training framework is the one that doesn't hang at 3 AM because of an NCCL timeout. DeepSpeed ZeRO-3 is that ugly, reliable tank."
KeyStat: On a 25Gbps interconnect, ZeRO-3 can achieve up to 40% higher throughput than vanilla FSDP for models over 40B parameters, purely due to better communication bucketing.
Gulshan Sharma
AI/ML Engineer, Full-Stack Developer
AI engineer and technical writer passionate about making artificial intelligence accessible. Building tools and sharing knowledge at the intersection of ML engineering and practical software development.
Continue Reading

** SGLang vs. vLLM: Why Your RAG Pipeline Needs RadixAttention to Scale
** I spent 3 weeks benchmarking SGLang vs vLLM. Here is why SGLang’s RadixAttention is crushing vLLM for high-throughput RAG and how to switch.
9 min read
Why RadixAttention Beats Chunked Prefill for Multi-Turn RAG (And When It Doesn’t)
Stop recalculating KV caches. Compare RadixAttention vs. Chunked Prefill to slash TTFT and optimize production LLM serving for RAG and agents.
9 min read
XGrammar vs. Outlines: How to Achieve 10x Higher Throughput for Structured LLM Outputs
Stop letting regex-based constraints kill your tokens per second. We compare XGrammar and Outlines for production-grade high-throughput structured decoding
10 min read