Beyond Jaccard: Why Your LLM Pre-training Pipeline Needs Semantic Deduplication

Title: Beyond Jaccard: Why Your LLM Pre-training Pipeline Needs Semantic Deduplication Slug: minhashlsh-vs-semdedup-llm-deduplication Category: LLM MetaDescription: A technical deep dive comparing MinHashLSH and SemDeDup for large-scale LLM data cleaning. Learn to optimize compute and data quality at scale.

If you are training an LLM on two trillion tokens, and 10% of your dataset consists of near-duplicates or paraphrased redundancies, you are effectively burning millions of dollars in compute for zero marginal gain in model perplexity. In fact, it’s worse than zero: over-representing specific data clusters leads to "memorization" and "hallucination" rather than generalization. For years, MinHashLSH was the industry standard for cleaning C4 or Common Crawl. But as we move toward high-quality "curated" datasets, the limitations of syntax-based deduplication have become a bottleneck, leading to the rise of SemDeDup (Semantic Deduplication).

I’ve spent the last year optimizing data ingestion pipelines for multi-billion parameter models, and the choice between these two isn't just a matter of "better" or "worse"—it’s about where you sit on the spectrum of compute cost versus data quality. Let’s break down the mechanics, the performance tradeoffs, and the implementation realities of these two approaches.

Quick Summary

MinHashLSH is a syntactic deduplication method that relies on Jaccard similarity. It is incredibly fast, horizontally scalable via Spark or Ray, and excellent at catching exact or near-exact string matches. However, it is blind to meaning; it cannot tell that "The weather is sunny" and "It is a bright, clear day" are redundant.

SemDeDup leverages dense embeddings (e.g., from CLIP or RoBERTa) to identify clusters of semantically similar documents. It is significantly more computationally expensive but can reduce dataset size by an additional 10-30% compared to MinHash alone by removing paraphrased content and low-information density samples. In modern pipelines, these are often used in a tiered approach: MinHash to remove the "junk" at the TB scale, and SemDeDup to refine the remaining GB-scale high-quality clusters.

The Syntactic Workhorse: MinHashLSH

MinHashLSH operates on the principle of Locality Sensitive Hashing (LSH). To understand why it scales, you have to understand the dimensionality reduction it performs. Instead of comparing every document to every other document ($O(N^2)$), we hash documents such that similar documents hash into the same "buckets" with high probability.

How it works in production

Shingling: You break text into n-grams (usually 5-grams or 13-grams).
MinHashing: You apply hundreds of hash functions to the set of shingles and keep the minimum hash value for each. This signature represents the document.
LSH Banding: You divide the signature into $b$ bands of $r$ rows. If two documents match in any single band, they are considered "candidate pairs."

The probability of two documents with Jaccard similarity $s$ being caught as candidates is $P = 1 - (1 - s^r)^b$. By tuning $b$ and $r$, you control the threshold.

The Problem with MinHash

The glaring weakness is that MinHash is keyword-dependent. If you change the word order or use synonyms, the Jaccard similarity plummets. In a massive pre-training run, this means your model sees the same news story reported by fifty different outlets—each with slightly different wording—and treats them as unique information. This redundancy is particularly harmful when Training Small LLMs with Synthetic Data: A Complete Guide, where every token must provide maximum signal-to-noise.

The Semantic Surgeon: SemDeDup

SemDeDup represents a paradigm shift. Instead of looking at characters, it looks at the embedding space. Developed by researchers at Meta AI, SemDeDup identifies redundancies that are invisible to MinHash.

The algorithm follows a three-step process:

Embedding Generation: Pass every document through a lightweight encoder (like a small transformer) to get a dense vector.
K-Means Clustering: Cluster the entire embedding space into $K$ centroids. This allows for localized comparisons.
Intra-cluster Deduplication: Within each cluster, calculate the cosine similarity. If two vectors are closer than a threshold $\epsilon$, the one further from the centroid (or the shorter one) is discarded.

Why SemDeDup is the New Gold Standard

SemDeDup doesn't just find duplicates; it finds semantic redundancy. It can identify that a Wikipedia summary of a movie and a fan-written summary of the same movie are redundant. More importantly, it helps in Evaluating LLM-as-a-Judge for Domain-Specific Tasks by ensuring the evaluation set doesn't contain leaked or paraphrased training data, which is a common failure mode in LLM benchmarking.

Head-to-Head Comparison: Production Metrics

Metric	MinHashLSH	SemDeDup
Computational Complexity	Low ($O(N)$ with hashing)	High ($O(N)$ for embeddings + $O(IKN)$ for clustering)
Memory Overhead	Low (integer signatures)	High (dense float32 vectors)
Precision (Syntactic)	Very High	High
Recall (Semantic)	Very Low	Very High
Hardware Requirement	CPU clusters (Spark)	GPU clusters (for embeddings/Faiss)
Scalability	Petabyte-scale	Terabyte-scale (limited by VRAM)

Implementation Guide: MinHashLSH with Datasketch

If you're dealing with raw web crawls, start here. This Python snippet uses the datasketch library, which is the standard for LSH in Python.

from datasketch import MinHash, MinHashLSH
from nltk import ngrams

def get_minhash(text, num_perm=128):
    m = MinHash(num_perm=num_perm)
    # Using 5-grams for shingles
    shingles = set([''.join(g) for g in ngrams(text, 5)])
    for s in shingles:
        m.update(s.encode('utf8'))
    return m

# Initialize LSH with a 0.8 Jaccard threshold
lsh = MinHashLSH(threshold=0.8, num_perm=128)

# Simulating data ingestion
data = {
    "doc1": "The quick brown fox jumps over the lazy dog",
    "doc2": "The quick brown fox jumped over the lazy dog", # Near duplicate
}

for doc_id, text in data.items():
    m = get_minhash(text)
    # Check for candidates before inserting
    result = lsh.query(m)
    if not result:
        lsh.insert(doc_id, m)
    else:
        print(f"Duplicate found: {doc_id} matches {result}")

Gotcha: In production, do not run this on a single machine. You should use a distributed framework like Ray. The num_perm parameter is your biggest lever. Increasing it improves precision but increases the size of your hash signatures linearly.

Implementation Guide: SemDeDup with Sentence-Transformers and Faiss

For semantic deduplication, we use faiss for efficient vector search and sentence-transformers for embeddings.

import numpy as np
import faiss
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2') # Lightweight but effective
documents = ["The sun is a star.", "Our solar system's center is the sun."]

# 1. Generate Embeddings
embeddings = model.encode(documents).astype('float32')

# 2. Setup Faiss Index for Cosine Similarity
# IndexFlatIP uses Inner Product, which is equivalent to Cosine if vectors are normalized
faiss.normalize_L2(embeddings)
index = faiss.IndexFlatIP(embeddings.shape[1])
index.add(embeddings)

# 3. Find Redundancies (Thresholding)
threshold = 0.95
D, I = index.search(embeddings, k=2) # k=2 because the closest is always itself

for i in range(len(embeddings)):
    # D[i][1] is the similarity to the second closest neighbor
    if D[i][1] > threshold:
        print(f"Document {i} is semantically redundant with {I[i][1]}")

Implementation Tip: For billion-scale vectors, use an IVF-PQ (Inverted File with Product Quantization) index in Faiss. This reduces the memory footprint of your embeddings by 10x-100x while maintaining search accuracy.

The Hidden Cost of Semantic Deduplication

I have seen teams jump straight to SemDeDup because "semantic is better," only to realize their AWS bill has tripled. The bottleneck is not just the GPU inference for embeddings; it's the All-to-All comparison. Even with clustering, you are loading massive vector sets into memory.

If you are building a pipeline for Fine-Tuning Open-Source LLMs for Domain-Specific RAG, you might only have 100k documents. In that case, SemDeDup is a no-brainer. But if you are processing 100TB of Common Crawl, the strategy should be:

Exact Match: Deduplicate by URL or MD5 hash (Cost: Nearly Zero).
MinHashLSH: Deduplicate near-exact matches (Cost: Moderate).
SemDeDup: Run on the "highest quality" subsets (e.g., academic papers, books) to ensure diverse representation (Cost: High).

Common Pitfalls in Production

1. The "N-Gram" Trap in MinHash

Using 1-word shingles is useless; you'll match on common stop words. Using 20-word shingles is too restrictive; you'll miss everything but exact copies. For LLM pre-training, 5-gram to 9-gram character shingles are usually the sweet spot for balance.

2. Embedding Drift

When using SemDeDup, the model you use for embeddings matters. If you use a model trained on social media text to deduplicate legal contracts, the "semantic" clusters will be poorly defined. Always use an embedding model that has been pre-trained on a diverse corpus or one that matches your target domain.

3. The "Length Bias" in SemDeDup

Cosine similarity often favors shorter documents when they are contained within longer ones. In many SemDeDup implementations, the shorter document is discarded. However, for LLM training, shorter documents often lack the context found in longer ones. I recommend a "survival of the most informative" rule: if two documents are semantically identical, keep the one with the higher token count or the higher "knowledge density" (calculated by entropy).

4. Ignoring the "Banding" Math

Many engineers set $b$ and $r$ in LSH based on a tutorial. If you don't plot the S-curve ($P = 1 - (1 - s^r)^b$), you will either over-filter (losing good data) or under-filter (keeping junk). Use a graphing tool to visualize the threshold before committing to a 100-node Spark job.

Scaling the Final Architecture

In a mature production pipeline, you don't choose one over the other. You build a multi-stage filtering funnel.

Stage 1 is a Bloom Filter or Redis-based check for exact string matches. Stage 2 is a distributed MinHash job running on Spark/Ray to clear out the bulk of the "boilerplate" web content. Stage 3 is where you get sophisticated: take your filtered documents, generate embeddings in batches, and use a GPU-accelerated library like Faiss or Raft to perform SemDeDup.

This tiered approach ensures that you aren't wasting expensive GPU cycles on documents that could have been caught by a simple hash. It also ensures that the final dataset—the one you're going to spend $500k training on—is as dense with information as possible.

Practical FAQ

Q: Can I use MinHashLSH for code deduplication? A: Yes, but with a twist. For code, you should normalize the text first: remove comments, standardize indentation, and perhaps replace variable names with generic tokens (e.g., VAR1, VAR2). This makes MinHash much more effective at catching "copy-pasted" code with minor changes.

Q: What is the optimal cluster size (K) for SemDeDup? A: It depends on your dataset size. A common heuristic is $K = \sqrt{N/2}$, where $N$ is the number of documents. However, for billion-scale datasets, $K$ is usually limited by the number of centroids that can fit in your GPU memory during the k-means training phase. Usually, $K$ between 10,000 and 100,000 is sufficient for large-scale pre-training.

Q: Is there a way to do SemDeDup without O(N^2) comparison? A: Yes, that's exactly what the clustering step does. By only comparing documents within the same cluster (and perhaps neighboring clusters), you reduce the search space from $N$ to $N/K$. This is why SemDeDup is feasible at scale while a naive cosine similarity search is not.

Q: Does deduplication always improve model performance? A: Not necessarily. If you deduplicate too aggressively, you might remove "common knowledge" that the model needs to see multiple times to learn the statistical distribution of language. The goal is to remove harmful redundancy (spam, boilerplate, scraped site headers) while keeping natural redundancy. Always validate your deduplication threshold against a small "proxy" model's training run first.