Periagoge
Concept
4 min readself knowledge

Semantic Search: Finding What Documents Mean, Not Just What They Say

Semantic search understands the meaning behind words, so searching for "blacksmith" finds documents about metalworkers even if they use different terminology, or searching "emigrated" captures ship manifests, letters, and church records that reference departure. This ability to find what documents mean—not just what they literally say—is especially valuable in historical research where language and terminology have shifted.

Hypatia
Why It Matters

Traditional genealogy search is keyword-driven: you search for "John Smith 1850" and the database returns only records matching those exact terms. Semantic search is fundamentally different. It understands meaning and context. A search for "my grandfather's siblings" doesn't just look for the word "sibling"; it understands family relationships and surfaces records for parents, aunts, uncles, and cousins—relationships that convey sibling context without using that exact word.

The technical foundation is word embeddings, learned vector representations where semantically similar words are positioned close together in high-dimensional space. Models like Word2Vec, GloVe, and modern transformers (BERT, GPT) compute these embeddings by training on large text corpora. In genealogy-specific embeddings, words like "son," "descendant," "heir," and "offspring" occupy nearby regions of the vector space because they co-occur frequently in genealogical documents and convey similar relationships.

A more sophisticated approach uses sentence-level embeddings (from models like Sentence-BERT or Universal Sentence Encoder) where entire phrases or document excerpts are converted to vectors. A genealogy database indexed this way enables queries like "Find records describing family disputes over inheritance" without using those exact words. The semantic search engine encodes your query as a vector, compares it to document vectors, and returns relevant results by geometric proximity—even if no single document contains all three words.

Genealogical context complicates semantic search meaningfully. The phrase "eldest son" has relationship meaning (the oldest male child), but semantic models trained on general English don't know this genealogical significance. A genealogy-specific embedding model, trained on millions of family history documents, learns that "eldest son" is contextually distinct from "youngest daughter" and clusters with succession-related records. This domain specialization improves search precision significantly.

Relationship graphs add another layer. Rather than searching documents in isolation, genealogical semantic search can traverse relationship networks. A query for "John Smith's descendants" might return John's direct children, their spouses, grandchildren, and great-grandchildren—relationships inferred from structural data, not keyword matching. Knowledge graph embeddings (extending word embeddings to represent entities and relationships) enable this kind of reasoning. Models like TransE or RotatE learn representations of relationship triples (John-IS_FATHER_OF-Samuel) such that semantically similar relationships cluster together.

Historical language drift creates interesting challenges. Census records, wills, and letters use terminology and phrasing evolving across centuries. An 1750 document referring to "the apprentice Thomas" might be genealogically equivalent to "my ward Thomas" in an 1800 letter, but surface-level keyword matching fails. Semantic models trained on historical documents learn these equivalences, but genealogists must be aware that a general embedding model (trained on modern text) performs worse on 18th-century documents than a historically-adapted model.

Fuzzy relationship querying is a unique genealogical use case. Instead of searching for exact relationships ("father of"), users often want probabilistic queries: "likely ancestors," "probable connections," or "people who may have known each other." This requires modeling uncertainty in the semantic search itself. Instead of returning a ranked list of documents, the system returns relationships with confidence scores: "87% likely to be a sibling based on location and date clustering." This is semantic search plus probabilistic inference.

Recall and precision trade-offs are particularly relevant for genealogy. A semantic search returning too many false positives (unrelated records that happen to be semantically similar) wastes research time. A search returning too few results (high precision, low recall) might miss genuine distant connections. Genealogists typically prefer higher recall with some false positives, because a missed connection is a lost research opportunity, whereas a false positive is quickly ruled out through manual verification. Tuning the similarity threshold reflects this preference.

Metadata integration amplifies semantic search power. Rather than embedding just document text, modern systems embed document metadata (date, location, occupation, source type) alongside text. A search for "printers in Dublin 1790-1810" isn't just keyword-matched but understood semantically: the system infers that printer-related occupations, Dublin locations, and the date range are jointly meaningful and boosts records matching all three dimensions. This requires multi-modal embeddings combining text and structured data.

Try this: Search a major genealogy database (Ancestry, FamilySearch) using both a traditional keyword search and any "semantic" or "smart" search feature they offer. Compare results for a query like "Find records related to my ancestor's business." Note which results are irrelevant (traditional keyword matching caught "business" without genealogical context) and which are unexpectedly useful (semantic search understood business ownership as a social standing indicator). This reveals how semantic understanding augments genealogy research beyond keyword matching.

Helpful guides
Hypatia
Daily Life & Decisions
Related Concepts
Peri
Questions about Semantic Search: Finding What Documents Mean, Not Just What They Say?

Peri can explain this concept, give practical examples, help you decide whether it applies to your situation, or recommend a journey if appropriate.

Ready to work on Semantic Search: Finding What Documents Mean, Not Just What They Say?

Explore related journeys or tell Peri what you're working through.