Forget LoRA: A Deep Dive into GaLore vs. BAdam for Full-Parameter LLM Fine-Tuning

Title: Forget LoRA: A Deep Dive into GaLore vs. BAdam for Full-Parameter LLM Fine-Tuning Slug: galore-vs-badam-memory-efficient-fine-tuning Category: LLM MetaDescription: Stop settling for LoRA. Compare GaLore and BAdam to achieve full-parameter LLM fine-tuning on consumer GPUs. Technical guide for memory-efficient training.
If you are still defaulting to LoRA (Low-Rank Adaptation) because you think full-parameter fine-tuning of a 7B or 13B model requires a massive cluster of H100s, you are leaving performance on the table. While Parameter-Efficient Fine-Tuning (PEFT) is a godsend for hobbyists, it often struggles to capture the deep, foundational shifts required for complex domain-specific tasks. The bottleneck isn't actually the model weights; it’s the optimizer states. In a standard AdamW setup, you're looking at 16 bytes of memory per parameter just for the optimizer and gradients, even before you load the model.
I have spent the last few months stress-testing two emerging contenders that promise "Full-Parameter" performance with LoRA-like memory footprints: GaLore (Gradient Low-Rank Projection) and BAdam (Block-wise Adam). These aren't just incremental improvements; they represent a fundamental shift in how we handle the backward pass. This post will break down the mechanics, the trade-offs, and the implementation details so you can decide which one to deploy in your next training run.
Quick Summary
| Feature | GaLore | BAdam |
|---|---|---|
| Mechanism | Low-rank projection of gradients | Sequential block-wise optimization |
| Memory Efficiency | High (60-80% reduction in optimizer states) | Very High (Updates only one "block" at a time) |
| Convergence Speed | Near-native AdamW | Slightly slower than AdamW |
| Full Parameter? | Yes (Updates all weights over time) | Yes (Updates all weights sequentially) |
| Primary Use Case | Training 7B models on 24GB consumer GPUs | Training large models on extremely limited VRAM |
The Memory Wall: Why Standard AdamW Fails
To understand why we need GaLore or BAdam, we have to look at the math of memory. When you fine-tune What Are Large Language Models, the GPU memory is consumed by:
- Model Weights: 2 bytes per parameter (FP16/BF16).
- Gradients: 2 bytes per parameter.
- Optimizer States (AdamW): 8 bytes per parameter (for the $m$ and $v$ moving averages, usually stored in FP32).
For a 7B model, that’s roughly 14GB for weights, 14GB for gradients, and 56GB for optimizer states. You’re at 84GB before you’ve even calculated a single activation or defined your batch size. This is why everyone uses LoRA—it only updates a fraction of these parameters. But LoRA can be limiting for Fine-Tuning Open-Source LLMs for Domain-Specific RAG where the model needs to learn entirely new vocabularies or structural logic.
GaLore: Gradient Low-Rank Projection
GaLore (Gradient Low-Rank Projection) works on a clever observation: while the weight matrix $W$ is full-rank, the gradient $G$ during training typically resides in a low-rank subspace. Instead of storing the full $G$ in the optimizer, GaLore projects the gradient into two low-rank matrices ($P$ and $Q$).
How it Works
When you use GaLore, you aren't training an "adapter." You are updating the actual weights of the model. The process looks like this:
- Compute the gradient $G$ for a layer.
- Apply Singular Value Decomposition (SVD) or a projection to find the low-rank subspace.
- Project the gradient: $\tilde{G} = P^T G Q$.
- Update the optimizer states $m$ and $v$ using the projected gradient $\tilde{G}$.
- Project back to the original weight space to update $W$.
Because the optimizer states now only track the low-rank $\tilde{G}$, the memory overhead for $m$ and $v$ drops by up to 90%. I’ve found that GaLore is particularly effective because it doesn't freeze any parameters; it just views them through a low-rank lens that shifts over time.
Implementing GaLore
If you're using the bitsandbytes or the official GaLore implementation, you don't need to rewrite your training loop. You just swap the optimizer.
from galore_torch import GaLoreAdamW8bit
# Define your layer groups (usually only Linear layers)
galore_params = []
for module in model.modules():
if isinstance(module, torch.nn.Linear):
galore_params.append({'params': [module.weight], 'rank': 128, 'update_proj_gap': 200})
optimizer = GaLoreAdamW8bit(
galore_params,
lr=1e-5,
weight_decay=0.01
)
Pro Tip: The update_proj_gap is crucial. It defines how often the projection matrices $P$ and $Q$ are re-computed via SVD. Set it too high, and the gradient approximation becomes stale. Set it too low, and the SVD overhead will tank your training throughput. I usually start at 200.
BAdam: Block-wise Adam Optimization
If GaLore is about "compression," BAdam is about "segmentation." BAdam (Block-wise Adam) acknowledges that we don't need to calculate the optimizer states for every single parameter at the same time.
How it Works
BAdam partitions the model into "blocks" (usually layers or groups of layers). During a training step, BAdam behaves like a coordinate descent algorithm:
- It selects one block (e.g., layers 1-4) to be "active."
- It performs a forward pass through the whole model.
- During the backward pass, it only calculates and stores gradients/optimizer states for the active block.
- It updates the weights for that block and moves to the next block in the next iteration.
This is fundamentally different from GaLore. BAdam allows you to tune a massive model on a tiny GPU because at any given millisecond, the optimizer is only "thinking" about a small fraction of the model.
Implementing BAdam
BAdam is often implemented as a wrapper around your training process. You need to define the block size based on your VRAM budget.
from badam import BAdamOptimizer
# BAdam wraps your model and handles the block switching
optimizer = BAdamOptimizer(
model=model,
lr=1e-5,
block_prefix="model.layers", # Specific to Llama/Mistral architectures
num_blocks=32,
active_block_ratio=0.1 # Only 10% of blocks active at once
)
# Standard training loop
for inputs, labels in dataloader:
loss = model(inputs, labels=labels).loss
loss.backward()
optimizer.step()
optimizer.zero_grad()
The Head-to-Head: Which Should You Choose?
1. Convergence Speed and Quality
In my experience, GaLore wins on convergence. Since GaLore updates all parameters in every step (via the projection), it behaves much more like standard AdamW. BAdam, by nature of being coordinate-descent-like, requires more iterations to reach the same loss levels. If you have the VRAM to support GaLore's rank requirements, use it.
2. Memory Absolute Minimums
BAdam wins on pure memory efficiency. If you are trying to fine-tune a 70B model on a single A100 (80GB) or a 7B model on a 12GB laptop GPU, GaLore might still fail because it needs to maintain the low-rank projection matrices for the whole model. BAdam’s ability to strictly limit the "active" parameters makes it the only choice for extreme edge cases.
3. Ease of Use
GaLore is easier to drop into existing pipelines (Hugging Face Trainer, etc.) because it's just an optimizer swap. BAdam requires more care in defining "blocks," and if your block boundaries are poorly chosen (e.g., splitting a residual block), training can become unstable.
Real-World Gotchas and Common Pitfalls
The GaLore "Rank" Trap
I see many developers setting the GaLore rank too low (e.g., 16 or 32) to save memory. While this works for LoRA, GaLore needs a higher rank (typically 128 or 256) to actually simulate full-parameter training. If your rank is too low, you aren't doing "Full-Parameter Fine-Tuning"; you're just doing a very expensive version of LoRA with higher overhead.
The BAdam "Block Stall"
When using BAdam, you must ensure your data shuffling is robust. Because you are only updating one block at a time, if the data in those specific iterations is biased, that specific block will "drift" away from the rest of the model. I recommend increasing your gradient accumulation steps when using BAdam to ensure each block update is based on a diverse enough sample of the data.
Interaction with Quantization (NF4/4-bit)
Both GaLore and BAdam can be combined with 4-bit quantization (via bitsandbytes). However, be careful with GaLore's 8bit optimizer. Combining 4-bit weights with 8-bit optimizer states and low-rank projections creates a lot of "approximation noise." I’ve found that using BF16 for weights and 8-bit for GaLore states is the "sweet spot" for stability.
Hardware Recommendations for 7B Models
If you're wondering what you need to run these today:
- Standard AdamW: 80GB VRAM (A100/H100).
- GaLore (Rank 128): 24GB VRAM (RTX 3090/4090). You can actually squeeze this into 16GB if you use 4-bit weights.
- BAdam: 12GB - 16GB VRAM (RTX 4070 Ti / 4080).
For those working on AI-driven fine-tuning for small language models on edge computing, BAdam is almost certainly your best bet, as it allows you to trade time for memory in a way that GaLore cannot.
Practical FAQ
Q: Can I use GaLore with DeepSpeed ZeRO-3? A: It is complicated. ZeRO-3 partitions parameters across GPUs, while GaLore expects to manage the gradient projection per-layer. Most current implementations of GaLore work best with ZeRO-2 or simple Data Parallelism. If you're scaling across a massive cluster, the benefits of GaLore diminish compared to what ZeRO-3 already provides.
Q: Does BAdam affect the final model's reasoning capabilities? A: In my testing, as long as you run enough epochs to ensure every block has been updated multiple times, the final weights are nearly indistinguishable from full-tuning. However, if you stop early, the "later" blocks in your BAdam sequence might be undertrained compared to the "earlier" blocks.
Q: Is there any reason to still use LoRA? A: Yes. LoRA is still faster in terms of wall-clock time because it has fewer parameters to compute gradients for. If you only need to change the style of the model (e.g., making it talk like a pirate), LoRA is sufficient. Use GaLore or BAdam when you need the model to learn new knowledge or complex logic that LoRA's low-rank adapters can't capture.
Next Steps
If you're ready to move beyond LoRA, I suggest starting with GaLore. It’s the more "natural" evolution of the AdamW optimizer and preserves the global training dynamics better than block-wise methods. Start with a 7B model on a single 24GB GPU, set your rank to 128, and compare the validation loss against your best LoRA run. You’ll likely find that GaLore reaches a lower loss floor, even if it takes 20% longer to train.
For those pushing the limits on consumer hardware, keep an eye on how these techniques evolve alongside optimizing MoE architectures for efficient inference, as the combination of Mixture-of-Experts and low-rank gradient projection is the likely future of local LLM development.

CyberInsist
Official blog of CyberInsist - Empowering you with technical excellence.
Continue Reading

Moving Beyond the Bi-Encoder: Why ColBERTv2 is the New Standard for Production RAG
A deep dive into ColBERTv2 vs. Bi-Encoders for RAG. Learn the technical trade-offs of late interaction, storage costs, and production latency.
5 min read
Scaling Beyond the VRAM Wall: A Technical Guide to Implementing Ring Attention
Learn how to implement Ring Attention for million-token context windows. Technical guide on overlapping communication with computation in distributed train
5 min read
Eliminating the VRAM Bottleneck: A Senior Engineer’s Guide to Implementing PagedAttention
Stop wasting GPU memory. Learn how to implement PagedAttention to solve KV cache fragmentation and significantly increase your LLM inference throughput.
5 min read