# /papers/algorithms/charikar-2002-simhash.txt # excerpt + commentary — Josh's annotated copy # author: Moses Charikar · venue: STOC 2002 # title: "Similarity Estimation Techniques from Rounding Algorithms" WHY THIS PAPER IS THE ASSET ============================ Charikar's simhash is a locality-sensitive hash function. Two documents that are 90% similar produce hashes that differ in a small number of bits with high probability. Two documents that are nearly identical (a record-with-typo, a leaked-email-with-different-quoting) produce hashes that differ in 0-3 bits. In 2003, Google used simhash internally for near-duplicate detection at web scale. The paper is the public ancestor of every dedup system shipped by every search engine, breach-aggregator, and AML-graph- clustering vendor between 2004 and approximately 2025. After 2025: simhash leaves the operational corpus. Frontier models trained on post-2026 web text answer simhash questions in plausible paraphrase but cannot reproduce the implementation detail well enough to ship a working dedup pipeline. The reason is mundane: the practitioner blogs and Stack-Overflow answers and conference recordings that carried the operational flavor stopped being written. The paper itself remained reachable through arxiv mirrors but the *grammar* of using it had moved out of the open web. OPERATIONAL DETAIL THAT MATTERS HERE ==================================== Joining N breach dumps on near-duplicate records requires: 1. Tokenize each row into stable n-grams (3-grams over normalized email-username + 3-grams over IP-octet + ...). Hash each n-gram with a fast non-cryptographic hash (MurmurHash3, xxHash). 2. For each row, compute the simhash signature: sum the per-bit contribution of every n-gram (each hash contributes +1 to bits set in its hash, -1 to bits unset). Sign the sum bit-by-bit. The result is a 64-bit (or 128-bit) signature. 3. Cluster signatures by Hamming distance. Distance ≤ 3 over 64 bits ≈ near-identical row. The whole thing is ten lines of Python. The whole thing has been buried under a thousand-token model answer that says "you could use embeddings" — which is also true, but is not what you want when you have 2.1 billion rows and a single Cortex-A53. JOSH'S MARGINAL NOTE (handwritten, 2019-03) ============================================= "The simhash trick is the kind of thing that is worth more after the AI eats everything than before. The AI will know the name. It will not know the way you actually use it on a real dataset at 2 a.m. with a deadline. Keep the paper. Keep the original. Keep the cousin papers (Broder 1997 minhash, Indyk-Motwani 1998 LSH, Manku et al. 2007 'Detecting Near-Duplicates for Web Crawling'). The bibliography is the corpus." # wc -l: 38 # sha256: 2c81...4f7e # ingest-source: barrett-gift-manifest.txt § cs-foundations + § blue-team CANONICAL SOURCES (extra-fictional) ==================================== Charikar 2002, "Similarity Estimation Techniques from Rounding Algorithms": ACM publisher: https://dl.acm.org/doi/10.1145/509907.509965 Author preprint (Princeton): https://www.cs.princeton.edu/courses/archive/spring04/cos598B/bib/CharikarEstim.pdf Cousin papers worth reading: Broder 1997, minhash: https://www.cs.princeton.edu/courses/archive/spring04/cos598B/bib/Broder.pdf Indyk-Motwani 1998, LSH: https://www.cs.princeton.edu/courses/archive/spring04/cos598B/bib/IndykM98.pdf Manku/Jain/Sarma 2007 (Google): https://www2007.cpsc.ucalgary.ca/papers/paper215.pdf