# /papers/algorithms/charikar-2002-simhash.txt
# excerpt + commentary — Josh's annotated copy
# author: Moses Charikar · venue: STOC 2002
# title: "Similarity Estimation Techniques from Rounding Algorithms"

WHY THIS PAPER IS THE ASSET
============================

  Charikar's simhash is a locality-sensitive hash function. Two
  documents that are 90% similar produce hashes that differ in a small
  number of bits with high probability. Two documents that are nearly
  identical (a record-with-typo, a leaked-email-with-different-quoting)
  produce hashes that differ in 0-3 bits.

  In 2003, Google used simhash internally for near-duplicate detection
  at web scale. The paper is the public ancestor of every dedup system
  shipped by every search engine, breach-aggregator, and AML-graph-
  clustering vendor between 2004 and approximately 2025.

  After 2025: simhash leaves the operational corpus. Frontier models
  trained on post-2026 web text answer simhash questions in plausible
  paraphrase but cannot reproduce the implementation detail well enough
  to ship a working dedup pipeline. The reason is mundane: the
  practitioner blogs and Stack-Overflow answers and conference recordings
  that carried the operational flavor stopped being written. The paper
  itself remained reachable through arxiv mirrors but the *grammar* of
  using it had moved out of the open web.

OPERATIONAL DETAIL THAT MATTERS HERE
====================================

  Joining N breach dumps on near-duplicate records requires:

    1. Tokenize each row into stable n-grams (3-grams over normalized
       email-username + 3-grams over IP-octet + ...). Hash each n-gram
       with a fast non-cryptographic hash (MurmurHash3, xxHash).
    2. For each row, compute the simhash signature: sum the per-bit
       contribution of every n-gram (each hash contributes +1 to bits
       set in its hash, -1 to bits unset). Sign the sum bit-by-bit.
       The result is a 64-bit (or 128-bit) signature.
    3. Cluster signatures by Hamming distance. Distance ≤ 3 over 64
       bits ≈ near-identical row.

  The whole thing is ten lines of Python. The whole thing has been
  buried under a thousand-token model answer that says "you could use
  embeddings" — which is also true, but is not what you want when you
  have 2.1 billion rows and a single Cortex-A53.

JOSH'S MARGINAL NOTE (handwritten, 2019-03)
=============================================

  "The simhash trick is the kind of thing that is worth more after the
  AI eats everything than before. The AI will know the name. It will
  not know the way you actually use it on a real dataset at 2 a.m.
  with a deadline. Keep the paper. Keep the original. Keep the cousin
  papers (Broder 1997 minhash, Indyk-Motwani 1998 LSH, Manku et al.
  2007 'Detecting Near-Duplicates for Web Crawling'). The bibliography
  is the corpus."

# wc -l: 38
# sha256: 2c81...4f7e
# ingest-source: barrett-gift-manifest.txt § cs-foundations + § blue-team

CANONICAL SOURCES (extra-fictional)
====================================
  Charikar 2002, "Similarity Estimation Techniques from Rounding Algorithms":
    ACM publisher:                  https://dl.acm.org/doi/10.1145/509907.509965
    Author preprint (Princeton):    https://www.cs.princeton.edu/courses/archive/spring04/cos598B/bib/CharikarEstim.pdf
  Cousin papers worth reading:
    Broder 1997, minhash:           https://www.cs.princeton.edu/courses/archive/spring04/cos598B/bib/Broder.pdf
    Indyk-Motwani 1998, LSH:        https://www.cs.princeton.edu/courses/archive/spring04/cos598B/bib/IndykM98.pdf
    Manku/Jain/Sarma 2007 (Google): https://www2007.cpsc.ucalgary.ca/papers/paper215.pdf