Periagoge
Concept
2 min readself knowledge

What Training Data Means (And Why It Matters for Family History)

The datasets that trained an AI system determine what it can recognize, so understanding what records, languages, time periods, and writing styles your genealogy AI was trained on tells you exactly where it will be reliable and where it will fail. An AI trained primarily on English 19th-century printed text will struggle with 17th-century Dutch handwriting, not because it's unintelligent, but because it never learned to recognize that particular combination.

Hypatia
Why It Matters

Think of AI training data like the books a historian studied. If a historian only read books about wealthy landowners, they'd have a skewed view of history and might give you bad advice about working-class ancestors. AI works the same way—it learns from the information it was trained on, and that creates blind spots.

AI systems are trained on text from the internet, databases, and documents. For genealogy, that means the AI has learned from family history websites, census records, historical documents, and genealogy forums. But here's the catch: this training data isn't perfectly balanced. There's more available information about some groups (wealthy families, people in major cities, people of European descent) than others (enslaved people, immigrant workers, Indigenous populations, people in rural areas). The AI reflects these imbalances.

What This Means for Your Research

When you ask AI to find information or make connections about your ancestors, it's drawing from this lopsided training data. For some ancestor types, AI will give you helpful, accurate guidance. For others—particularly ancestors from minority groups, immigrant backgrounds, or very poor populations—AI might miss important context or struggle because fewer records were digitized in the first place.

This isn't the AI being lazy. It literally doesn't have access to as much information about those groups because historically, fewer records were preserved or digitized. Additionally, AI might confidently state something that's actually false because it learned from genealogy websites that repeated the same error.

How to Protect Yourself

Always verify what AI tells you by checking original sources. If AI suggests a connection between two ancestors, trace back to where it found that information. For ancestors from under-documented populations, use AI as a starting point only—lean heavily on original records and community history resources.

Try this: Ask AI about your most well-documented ancestor and your least-documented one. Notice the difference in confidence and detail. The variation shows you where AI's training data is strongest and weakest.

Helpful guides
Hypatia
Daily Life & Decisions
Related Concepts
Peri
Questions about What Training Data Means (And Why It Matters for Family History)?

Peri can explain this concept, give practical examples, help you decide whether it applies to your situation, or recommend a journey if appropriate.

Ready to work on What Training Data Means (And Why It Matters for Family History)?

Explore related journeys or tell Peri what you're working through.