Record Linkage and Probabilistic Matching in Family Trees

Record linkage—also called entity resolution, deduplication, or matching—solves a fundamental genealogical problem: the same person appears across multiple documents with variations in name, date, location, and spelling. "William Smith," "Wm Smith," "William Smyth," and "W. Smith" might all be your ancestor, or they might be different people entirely. Genealogists traditionally solved this through painstaking manual comparison; AI brings probabilistic matching to automate the process.

Record linkage works by computing similarity between records across multiple dimensions simultaneously. A deterministic approach uses hard rules: "If first name and last name match exactly AND age within 2 years AND birthplace identical, it's a match." This works for clean data but fails with historical records containing spelling drift, age approximations, and transcription errors. Probabilistic matching is more sophisticated, assigning weights to different matching criteria and computing an overall probability that two records refer to the same person.

The technical foundation relies on Fellegi-Sunter theory, a probabilistic framework developed for census reconciliation. The model computes likelihood ratios comparing two hypotheses: H1 (the records are the same person) versus H2 (they are different people). For each data field (name, age, birthplace), it calculates P(data | H1)—the probability of observing that data if they are the same person—and P(data | H2). These probabilities come from training data showing how frequently exact matches, near matches, and mismatches occur among known duplicates and known distinct records.

A simplified example: if an exact first-name match occurs 95% of the time among records for the same person but only 10% of the time among randomly paired records, that's strong evidence for a match (likelihood ratio of 9.5). A last-name mismatch might occur 5% of the time for the same person (due to transcription error or remarriage) but 80% of the time for different people, making it weak evidence against a match. The algorithm combines all such signals into a final score.

Genealogists encounter specific challenges that standard record linkage doesn't address. Women's names change with marriage; "Mary Johnson" becomes "Mary Smith" after marriage to John Smith. Standard matching would see a name change as evidence against a match. Genealogical systems need knowledge about female name patterns and must consider that a woman marrying forward in time is entirely consistent. Some systems incorporate explicit handling for marriage name changes.

Blocking is a critical optimization. Computing pairwise similarity for every record against every other record scales as O(n²)—with 10 million historical records, that's 50 trillion comparisons. Blocking pre-filters candidates by grouping records into blocks based on fast matching criteria (same birthplace, same surname, same century). Within each block, expensive probabilistic matching is performed. This reduces comparison count from 50 trillion to millions, making the process computationally feasible.

Distance metrics for string similarity matter profoundly. Levenshtein distance (edit distance) measures insertions, deletions, and substitutions needed to transform one string into another. "Smith" vs. "Smyth" has distance 1 (one substitution). Metaphone or Soundex algorithms collapse similar-sounding names, handling spelling variations that phonetically match. For genealogy, a hybrid approach works best: compute multiple similarity measures and combine them in a learned model.

A critical architectural consideration: temporal consistency. If "William Smith b. 1850" in document A is matched to "William Smith b. 1847" in document B, the age discrepancy must be explainable by measurement error or misrecording. A systematic model might accept 5-year discrepancies as plausible but flag 20-year discrepancies. Genealogical systems layer temporal logic on top of probabilistic matching to enforce domain constraints.

Thresholding is judgmental. A match probability of 0.85 might be too conservative (rejecting true matches) or too liberal (accepting false matches). Rather than a hard threshold, production systems often use probabilistic thresholding: probabilities 0-0.2 are "non-matches" (confidence threshold), 0.8-1.0 are "matches" (confidence threshold), and 0.2-0.8 are "uncertain" (flagged for human review). This distributes work efficiently—the algorithm handles only the easiest decisions automatically.

Try this: Take three variants of your own name or a family member's name (e.g., "John Smith," "Jon Smyth," "J. Smith"). Use an online string similarity calculator to measure Levenshtein distance and Soundex code for each pair. Then manually estimate what probability each pair represents the same person. Compare your intuition to the algorithms' outputs. This reveals which similarity metrics align with genealogical reasoning and which miss important nuances in your specific case.

Record Linkage and Probabilistic Matching in Family Trees

Ready to work on Record Linkage and Probabilistic Matching in Family Trees?