Beyond Jaccard: Why Your LLM Training Pipeline Needs SemDeDup Over MinHash-LSH (And How to Scale It)

Title: Beyond Jaccard: Why Your LLM Training Pipeline Needs SemDeDup Over MinHash-LSH (And How to Scale It) Slug: semdedup-vs-minhash-lsh-llm-deduplication Category: LLM MetaDescription: Stop wasting compute on redundant data. Compare SemDeDup and MinHash-LSH for LLM training pipelines with technical implementation guides and scaling tips.

If you are still relying solely on MinHash-LSH to clean your pre-training corpora, you are leaving massive efficiency gains on the table—and likely poisoning your model’s perplexity with "semantic noise." In the early days of GPT-2 and GPT-3, deduplication was about catching verbatim copies or near-duplicates. But as we move into the era of What Are Large Language Models that are increasingly sensitive to data quality, the "fuzzy matching" of MinHash is no longer sufficient.

I’ve spent the last year optimizing data ingestion pipelines for multi-billion parameter models, and the biggest realization I’ve had is that syntax is not semantics. MinHash catches documents that look the same; SemDeDup catches documents that mean the same. If you have 500 variations of a "Hello World" Python tutorial written in slightly different styles, MinHash might keep 400 of them. SemDeDup will keep one.

Quick Summary

MinHash-LSH is a syntax-based deduplication method. It uses n-gram "shingling" and Jaccard similarity to find documents with high lexical overlap. It is incredibly fast and scales linearly, but misses paraphrases and semantically identical content.
SemDeDup (Semantic Deduplication) uses dense embeddings (from models like BERT or OPT) and K-means clustering to identify and prune data based on cosine similarity in vector space.
The Verdict: For massive web crawls (Common Crawl), use MinHash-LSH as a first-pass filter. For high-quality fine-tuning or final-stage pre-training, SemDeDup is non-negotiable. It can reduce dataset size by an additional 10-20% beyond MinHash without degrading model performance.

The Syntax Wall: Where MinHash-LSH Fails

MinHash-LSH is the industry workhorse because it’s cheap. You break text into sets of "shingles" (n-grams), hash them multiple times, and group them into "buckets" (Locality Sensitive Hashing). If two documents share enough buckets, they are likely duplicates.

But consider these two sentences:

"The stock market plummeted today following the Federal Reserve's announcement regarding interest rate hikes."
"Equities crashed this afternoon after the Fed revealed it would be increasing borrowing costs."

To a MinHash algorithm with 5-gram shingles, these look almost entirely different. The Jaccard similarity will be near zero. However, in the context of Fine-Tuning Open-Source LLMs for Domain-Specific RAG, these two sentences are redundant. Training a model on both is a waste of FLOPs.

The Scaling Law of Redundancy

Research from Meta AI suggests that up to 30% of web-scraped data is redundant at a semantic level. When you train on this redundancy, you aren't just wasting electricity; you are effectively over-weighting certain facts or styles, which leads to the "memorization" problem where models parrot training data instead of generalizing.

SemDeDup: The Semantic Alternative

SemDeDup changes the game by moving the comparison from the character level to the latent space. Instead of shingles, we use embeddings.

The workflow I use for SemDeDup generally follows these steps:

Embedding Generation: Pass each document through a lightweight encoder (e.g., all-MiniLM-L6-v2 or a dedicated pre-training encoder).
K-Means Clustering: To avoid the $O(N^2)$ nightmare of comparing every document to every other document, we cluster the embeddings into $K$ clusters.
Intra-Cluster Deduplication: Within each cluster, we sort documents by their distance to the cluster centroid and perform a pairwise cosine similarity check. If $sim(A, B) > \epsilon$, we drop the document with the lower "quality score" (usually defined by length or perplexity).

Implementing a Scalable SemDeDup Pipeline

You cannot run a naive $O(N^2)$ comparison on 100 million documents. You need to leverage Faiss (Facebook AI Similarity Search) and distributed clustering. Here is a Python-based blueprint for how I implement this using Faiss and Sentence-Transformers.

import numpy as np
import faiss
from sentence_transformers import SentenceTransformer

def semdedup_pipeline(texts, threshold=0.95, n_clusters=1000):
    # 1. Generate Embeddings (Use a GPU if possible)
    model = SentenceTransformer('all-MiniLM-L6-v2')
    embeddings = model.encode(texts, show_progress_bar=True, convert_to_numpy=True)
    
    # Normalize for cosine similarity
    faiss.normalize_L2(embeddings)
    
    # 2. Clustering to reduce search space
    dimension = embeddings.shape[1]
    kmeans = faiss.Kmeans(dimension, n_clusters, niter=20, verbose=True, gpu=True)
    kmeans.train(embeddings)
    
    # Assign documents to clusters
    _, labels = kmeans.index.search(embeddings, 1)
    
    keep_indices = []
    
    # 3. Deduplicate within each cluster
    for i in range(n_clusters):
        cluster_indices = np.where(labels.ravel() == i)[0]
        if len(cluster_indices) == 0:
            continue
            
        cluster_embeddings = embeddings[cluster_indices]
        
        # Build a temporary index for the cluster
        cluster_index = faiss.IndexFlatIP(dimension)
        cluster_index.add(cluster_embeddings)
        
        # Search for neighbors within the cluster
        D, I = cluster_index.search(cluster_embeddings, k=len(cluster_indices))
        
        # Mask for seen documents
        already_removed = set()
        for idx_in_cluster, neighbors in enumerate(I):
            if cluster_indices[idx_in_cluster] in already_removed:
                continue
            
            # Keep the first one, mark others above threshold as duplicates
            for neighbor_idx, similarity in zip(neighbors, D[idx_in_cluster]):
                # Similarity is cosine because we normalized and used IndexFlatIP
                if neighbor_idx != idx_in_cluster and similarity > threshold:
                    already_removed.add(cluster_indices[neighbor_idx])
            
            keep_indices.append(cluster_indices[idx_in_cluster])
            
    return keep_indices

The "O(N^2) Nightmare" and How to Wake Up

If you try to run the code above on a 10TB dataset, your head node will explode. The bottleneck isn't the embedding generation—you can parallelize that across a thousand workers—it’s the clustering and the pairwise comparison.

In production, I recommend a Hierarchical K-Means approach. Instead of one massive clustering step, you perform a coarse clustering to partition the data into 10,000 shards. You then distribute those shards across a Spark cluster or a Ray cluster. Each worker handles the deduplication for one shard independently. Since semantic duplicates are highly likely to land in the same cluster, the "leakage" between shards is negligible.

This is particularly important when Training Small LLMs with Synthetic Data: A Complete Guide because synthetic data is notoriously repetitive. Without SemDeDup, your "small" model will collapse into a mode where it only generates the most frequent patterns from your synthetic generator.

Comparison: When to Use Which?

Metric	MinHash-LSH	SemDeDup
Primary Signal	Lexical overlap (n-grams)	Semantic meaning (Embeddings)
Computational Cost	Low (CPU bound, hashing)	High (GPU bound, inference + clustering)
Memory Footprint	Low (Hash tables)	High (Vector storage)
Scale	Trillions of tokens	Billions of tokens
Best For	Removing "boiler plate", ad-copy, and scrapes	Pruning high-quality instruction data

Gotchas and Common Pitfalls

1. The Centroid Sensitivity Trap

If your cluster count ($K$) is too small, your clusters will be too diverse, and you'll spend too much time on pairwise comparisons. If $K$ is too large, you might split two semantically identical documents into two different clusters, meaning they will never be compared, and the duplicate survives. I’ve found that a cluster size of roughly 5,000–10,000 documents per cluster is the "sweet spot" for balancing compute and recall.

2. Thresholding is Not One-Size-Fits-All

A cosine similarity of 0.92 might be a "duplicate" in a dataset of medical papers, but it might represent two distinct, valuable points in a dataset of legal code. You must validate your threshold by manually inspecting the "pruned" pairs. If you see unique information being discarded, back off your threshold by 0.02 and re-run.

3. The Embedding Model Matters

If you use a basic BERT model for embeddings, it might be biased toward sentence length rather than meaning. I strongly recommend using models trained on Contrastive Learning (like SimCSE or the GTE series). These models are specifically optimized to push "meaning-different" sentences apart in vector space, which is exactly what you need for SemDeDup.

4. Ordering Effects

In the implementation above, I simply keep the "first" document encountered in the cluster. In a production pipeline, you should rank them. Use a "Quality Scorer" (like an N-gram model trained on Wikipedia or a fast classifier) to ensure that when you find two duplicates, you keep the one with the higher signal-to-noise ratio.

Hardware Considerations: Why VRAM is Your Best Friend

MinHash-LSH can run on a cluster of cheap, CPU-only nodes. SemDeDup essentially requires GPUs for two stages:

Inference: Getting the embeddings.
Search: Using faiss-gpu to accelerate the K-means and the similarity search.

If you are working with 100M+ documents, you’ll need at least an A100 (80GB) or multiple H100s just to keep the FAISS index in memory if you aren't using an IVFFlat index (which uses centroids to limit search). If you’re on a budget, look into Product Quantization (PQ) in FAISS. It compresses your vectors by a factor of 10-20x with a minor hit to deduplication accuracy.

The Hybrid Approach: The Gold Standard

I don't recommend choosing one. The most efficient pipelines I've built use a layered approach:

Exact Match: Hash the whole string. Remove 1:1 copies. (Cost: Near zero).
MinHash-LSH: Remove the near-verbatim copies and web-scrape templates. (Cost: Low).
SemDeDup: Run on the "cleaned" output of MinHash to prune semantic redundancy and "shallow" content. (Cost: High, but applied to a smaller volume).

This tiered approach saves you from running expensive embedding models on data that could have been caught by a simple hash.

Practical FAQ

Q: Can SemDeDup handle multi-lingual data? A: Yes, but only if you use a multi-lingual embedding model (like paraphrase-multilingual-MiniLM-L12-v2). Standard models will fail to recognize that a Spanish sentence and an English sentence mean the same thing. However, for most LLM pre-training, you actually want to keep both versions to help the model learn translation pairs.

Q: Does deduplication always improve model performance? A: Not necessarily. If you over-deduplicate, you might remove the "natural frequency" of facts, making it harder for the model to distinguish between common knowledge and rare outliers. Deduplication is about removing redundant noise, not removing common truths.

Q: How do I handle very long documents with SemDeDup? A: Standard embedding models have a token limit (usually 512). If you have 5,000-word documents, don't just embed the first 512 tokens. I suggest chunking the document, embedding the chunks, and using the "Mean Pool" of those chunk embeddings as the document representation. Alternatively, just embed the "Lead" and "Summary" if the data is structured.

Next Steps

Data curation is the new "Model Architecture." While the industry focuses on adding more parameters, the winners are those who curate the highest quality data per FLOP. Implementing SemDeDup is a heavy lift, but the reduction in training time—and the improvement in model reasoning—makes it the most high-leverage move you can make in a modern LLM pipeline.

If you're just starting out, start by integrating MinHash into your workflow using tools like datasketch. Once you hit the limits of what lexical matching can do, move to the FAISS-based SemDeDup approach outlined here. Your GPU budget will thank you.