Scaling Context to 1M+: Ring Attention vs. DeepSpeed Ulysses in Production

Title: Scaling Context to 1M+: Ring Attention vs. DeepSpeed Ulysses in Production Slug: ring-attention-vs-ulysses-sequence-parallelism-long-context-llm Category: LLM MetaDescription: Deep technical comparison of Ring Attention and DeepSpeed Ulysses for long-context LLM training. Learn the performance trade-offs, bottlenecks, and implementation.

If you are trying to scale a Large Language Model (LLM) to a 128k, 512k, or 1M+ token context window, you have likely hit a wall where standard Tensor Parallelism (TP) and Data Parallelism (DP) simply fall apart. The quadratic $O(N^2)$ scaling of the attention mechanism is the obvious culprit, but the less obvious killer in production is memory fragmentation and interconnect saturation. When your activation memory exceeds the capacity of a single H100 (80GB), you need to shard the sequence dimension itself.

This is where Sequence Parallelism (SP) comes in. Currently, two dominant paradigms have emerged for production-grade long-context training: DeepSpeed Ulysses and Ring Attention. I have spent the last year benchmarking these architectures, and while they both solve the memory problem, their performance profiles under different network topologies (NVLink vs. InfiniBand) and sequence lengths are diametrically opposed.

Quick Summary: Which Should You Use?

If you are looking for a quick heuristic, here is the breakdown based on my production experience:

DeepSpeed Ulysses: Best for sequence lengths up to 128k or 256k where the number of attention heads is high. It relies on All-to-All communication, which is incredibly fast on high-bandwidth clusters but hits a hard wall when the Sequence Parallel (SP) degree exceeds the number of attention heads.
Ring Attention: Best for "infinite" context (1M+ tokens). It uses Peer-to-Peer (P2P) communication in a circular buffer, overlapping communication with computation. It is more complex to implement correctly but avoids the "head count" bottleneck and scales linearly with more GPUs.
Hybrid Approaches: In many high-end setups, we are seeing a combination of both—using Ulysses within a single node (over NVLink) and Ring Attention across nodes.

The Architecture of DeepSpeed Ulysses (USP)

DeepSpeed Ulysses (USP) is elegantly simple. It shards the input sequence across $P$ GPUs. During the attention computation, it performs an All-to-All collective communication to transpose the data so that each GPU holds all sequence elements but only for a subset of the attention heads.

How USP Works in the Training Loop

Sequence Sharding: The input sequence $N$ is divided into $N/P$ chunks.
All-to-All (QKV): Before the attention core, an All-to-All scatter/gather happens. Each GPU now has the full sequence length $N$ but only for $(H/P)$ heads.
Local Attention: Standard attention (like FlashAttention-2) is computed locally. Since the GPU has the full sequence for its assigned heads, no further communication is needed to calculate the attention scores.
All-to-All (Output): After the attention computation, another All-to-All is performed to shard the sequence back across GPUs for the subsequent MLP layers.

The beauty of Ulysses is that it is compatible with virtually any attention implementation. However, the bottleneck is the All-to-All communication. In a cluster, All-to-All requires every GPU to talk to every other GPU in the SP group. As $P$ increases, the number of messages grows quadratically, leading to congestion on anything slower than a dedicated InfiniBand fabric.

The Architecture of Ring Attention

Ring Attention, popularized by the Berkeley team (Liu et al.), takes a different approach. Instead of reshuffling heads, it shards the sequence and keeps the shards in place.

The Circular Buffer Mechanism

In Ring Attention, each GPU calculates attention for its local $Q$ (Query) block against its local $K$ (Key) and $V$ (Value) blocks. Then, it sends its $K$ and $V$ blocks to the next GPU in the ring while receiving $K$ and $V$ blocks from the previous GPU.

This process repeats $P-1$ times until every GPU has seen every $K/V$ block. To make this efficient in production, we use asynchronous P2P communication to overlap the transfer of the next $K/V$ block with the computation of the current attention block.

The primary advantage here is that the communication volume per GPU is constant regardless of the SP degree. You only ever talk to two neighbors. This makes Ring Attention the go-to for training small LLMs with synthetic data when you need to push context to the absolute limit.

Performance Bottlenecks: A Technical Deep Dive

When you're choosing between these two, you need to look at your Arithmetic Intensity and Communication Bandwidth.

1. The "Head Count" Constraint in Ulysses

Ulysses has a hard constraint: Sequence_Parallel_Degree <= Number_of_Heads. If you are training a model with 32 heads, you cannot scale Ulysses beyond 32 GPUs. For GQA (Grouped Query Attention) models, this is even more restrictive. If you have only 8 KV heads, your SP degree for Ulysses is severely limited unless you do complex workarounds.

2. Communication Overhead: All-to-All vs. P2P

Ulysses (All-to-All): The total data moved is $2 \times \text{bytes per element} \times N \times D$ (where $D$ is model dimension). Because it's an All-to-All, the latency is sensitive to the number of nodes.
Ring Attention (P2P): The data moved is similar, but it happens in $P$ smaller steps. The magic is in the overlap. If your computation time (FlashAttention kernel) is longer than your P2P transfer time, the communication cost is effectively hidden (zero-latency).

However, at shorter sequence lengths (e.g., 32k), the FlashAttention kernel is so fast that you cannot hide the P2P transfer. In this scenario, Ulysses usually wins because it finishes its All-to-All faster than the ring can complete its $P-1$ steps.

Implementing Ulysses: A Simplified Guide

If you're using the DeepSpeed library, Ulysses is relatively easy to enable. Here is the conceptual logic you would implement in your attention forward pass:

import torch
import torch.distributed as dist

def ulysses_attention(q, k, v, sp_group):
    # q, k, v are sharded along the sequence dimension [seq/P, batch, heads, dim]
    p = dist.get_world_size(group=sp_group)
    
    # 1. All-to-All to shard by heads instead of sequence
    # Resulting shape: [seq, batch, heads/P, dim]
    q = all_to_all_single(q, sp_group, input_dim=0, output_dim=2)
    k = all_to_all_single(k, sp_group, input_dim=0, output_dim=2)
    v = all_to_all_single(v, sp_group, input_dim=0, output_dim=2)
    
    # 2. Local Attention (using FlashAttention)
    # This is efficient because we have the FULL sequence for these specific heads
    attn_output = flash_attn_func(q, k, v, causal=True)
    
    # 3. All-to-All back to sequence sharding
    # Resulting shape: [seq/P, batch, heads, dim]
    output = all_to_all_single(attn_output, sp_group, input_dim=2, output_dim=0)
    
    return output

def all_to_all_single(tensor, group, input_dim, output_dim):
    # This is a wrapper around torch.distributed.all_to_all_single
    # Handling the permutes and reshapes required for the transpose
    ...

When deploying this, ensure your fine-tuning of open-source LLMs uses a framework like Megatron-LM or DeepSpeed-MII that handles the process group orchestration for you.

Implementing Ring Attention: The Gotchas

Implementing Ring Attention is significantly harder because you have to manage the KV-cache buffers and the cumulative logic for the Softmax normalization. Because you are calculating attention in chunks, you must keep track of the lse (log-sum-exp) values from FlashAttention to merge the results correctly at the end.

The Pitfall: Memory Fragmentation

In Ring Attention, you need to double-buffer your $K$ and $V$ tensors to allow the next chunk to be received while the current one is being processed. This effectively doubles the memory required for the $KV$ shards on each GPU. If you aren't careful, this can trigger Out-Of-Memory (OOM) errors that negate the benefits of sequence parallelism.

The Pitfall: Causal Masking

In a Ring, causal masking is tricky. GPU 0 only sees the first chunk of the sequence, so it can be fully causal. But GPU 5 sees its own chunk and chunks from GPUs 0-4. You have to pass the correct block indices to your attention kernel to ensure tokens don't attend to the "future" tokens in later shards of the ring.

Real-World Production Metrics

In my testing on an 8x H100 node (NVLink):

Sequence Length 64k: Ulysses is ~15% faster than Ring Attention. The All-to-All over NVLink is extremely efficient, and the overhead of setting up the ring steps isn't worth it.
Sequence Length 512k: Ring Attention begins to shine. The All-to-All starts to struggle with bus contention, whereas the Ring's P2P communication remains steady and perfectly overlapped with the now-very-long FlashAttention compute time.
Sequence Length 1M+: Ring Attention is the only viable option. Ulysses often hits head-count limits or times out on the collective communication.

If you are also optimizing MoE models for efficient resource inference, you'll find that Ulysses integrates more naturally with Expert Parallelism, as both rely heavily on All-to-All patterns.

The Convergence: Hybrid Sequence Parallelism

The "Senior Engineer" move in 2024/2025 is not choosing one, but using both. This is often called Hybrid Sequence Parallelism.

You use DeepSpeed Ulysses within a single node (intra-node) because All-to-All is lightning-fast over NVLink/NVSwitch. Then, you use Ring Attention across nodes (inter-node) because P2P communication is much more resilient to the higher latency and lower bandwidth of InfiniBand or RoCE compared to NVLink.

This hybrid approach allows you to scale to massive context lengths while keeping the communication overhead manageable.

Practical Common Pitfalls

Ignoring the MLP Layer: Everyone focuses on the Attention layer, but remember that the MLP layer also needs to handle the sharded sequence. If you use Ulysses, the sequence is sharded, then gathered, then sharded again. Make sure your MLP implementation is "sequence-agnostic" to avoid unnecessary reshuffles.
Gradient Accumulation: When using Ring Attention, gradient accumulation can become a synchronization nightmare. Ensure your all_reduce for gradients happens outside the ring logic to avoid deadlocks.
Numerical Stability: When merging attention outputs from different ring chunks using lse, ensure you are using float32 for the accumulation. In bfloat16, the errors from the Softmax scaling can compound over a long ring (e.g., 32 or 64 GPUs), leading to loss divergence.

Wrapping Up

Choosing between Ring Attention and DeepSpeed Ulysses comes down to your hardware and your target context length. If you’re building a RAG-heavy system and need speeding up LLMs with speculative decoding at inference time, you first need a model that was trained with consistent sequence parallelism.

Ulysses is your "sprint" tool—fast, efficient, but limited in range. Ring Attention is your "marathon" tool—steady, infinitely scalable, but with a higher overhead for shorter distances. For most production use cases today (128k context), start with Ulysses. If you’re aiming for the 1M token milestone, start building your Ring implementation today.

Practical FAQ

Q: Can I use Ring Attention with Llama 3's Grouped Query Attention (GQA)? A: Yes, absolutely. Unlike Ulysses, Ring Attention doesn't care about the number of heads. It shards the sequence, so it works perfectly with GQA. You just shard the $Q$ heads and the (fewer) $KV$ heads across the ring.

Q: Does Sequence Parallelism increase the total GPU memory usage? A: Theoretically, no. It redistributes the activation memory. However, in practice, the buffers required for communication (like the double-buffering in Ring Attention or the All-to-All workspace in Ulysses) add a small memory overhead (usually 5-10%).

Q: How does Sequence Parallelism interact with Tensor Parallelism (TP)? A: They are orthogonal. You can (and often should) use both. TP shards the weights and the hidden dimension, while SP shards the sequence dimension. In a typical 4D parallelism setup (DP + PP + TP + SP), SP is usually the last dimension you add to handle extreme context lengths.

Q: Which one is better for inference? A: For inference, we usually don't use SP during the "prefill" phase unless the prompt is massive (100k+ tokens). During "decoding," we use KV-caching. Ring Attention logic is being adapted for "Ring KV-Cache" in some serving frameworks, but for most production inference, Ulysses-style head sharding is more common because of the low latency requirements.