Speech-to-text AI struggles with non-standard speech patterns—accents, dysarthria, vocal strain, atypical rhythm—because most models train on a narrow band of typical speakers, leaving disabled users with high error rates that make dictation unusable. The gap isn't a technical accident; it's a training data problem, and it means AI assistants often work worst for the people who rely on them most.
AI speech-to-text systems like Otter.ai have transformed accessibility for people with mobility limitations or those who prefer speaking over typing. However, transcription accuracy varies dramatically depending on speaker characteristics, audio quality, and language variation. For disabled speakers—particularly those with speech disabilities, non-native accents, or atypical prosody—accuracy can be substantially lower than advertised benchmarks.
Modern speech recognition uses neural networks trained on massive audio datasets. The training data is typically biased toward certain demographic groups: younger speakers, native English speakers with minimal accents, clear articulation without stuttering or dysarthria, and good-quality audio without background noise. When the model encounters speech patterns absent from training data, accuracy degrades.
This isn't a flaw in the AI's design—it's a statistical consequence of training data composition. If 95% of training audio is from native speakers, the model optimizes for this distribution. When processing someone with a heavy accent, the model assigns lower probability to their actual words because it was trained to expect native speaker patterns. The model isn't 'understanding' speech—it's pattern matching based on statistical probabilities learned from training data.
Research shows typical commercial speech-to-text systems achieve 95%+ accuracy for native speakers in quiet conditions, but accuracy drops to 60-75% for non-native speakers and potentially below 50% for speakers with dysarthria or severe stuttering. This accuracy difference is not random—it's systematic, meaning certain words are reliably mistranscribed. For someone relying on transcription for communication or documentation, this is unacceptable.
Interestingly, human transcribers also struggle with similar populations, though through different mechanisms (fatigue, unfamiliarity) than AI. The advantage of AI is that with sufficient speaker-specific training data, accuracy can improve dramatically. This requires adaptation—feeding the system examples of your speech patterns so it can adjust its probability calculations for your specific characteristics.
Speaker adaptation: Many systems allow users to upload voice samples or training audio. The system fine-tunes its model using your voice characteristics. Otter.ai's speaker identification feature can help separate multiple speakers in audio. After processing several of your recordings, accuracy typically improves.
Custom vocabulary: If you frequently use specialized terms, names, or non-standard words, you can add them to a custom dictionary. This biases the model's probability calculations toward words you're likely to say, improving accuracy for your specific domain.
Post-transcription correction: Rather than perfect transcription, AI provides a foundation that humans edit. For people using AAC (augmentative and alternative communication) devices, real-time transcription with human correction loops has proven effective. The user speaks, the system transcribes, the assistant corrects obvious errors, and the corrected output feeds back to improve future predictions.
Audio quality optimization: High background noise disproportionately harms accuracy for speech with lower volume or clarity. Using microphones positioned close to the speaker, noise-cancelling settings, and quiet environments significantly improves transcription quality.
Standard benchmarking of transcription accuracy using demographically uniform speakers masks the real-world equity problem. A system reporting 95% accuracy might have 95% for native speakers but 70% for non-native speakers. Marketing emphasizes the former; disabled users experience the latter. Responsible deployment requires testing accuracy across diverse speaker populations and being transparent about group-level performance differences.
Try this: If you use any speech-to-text tool (Otter, built-in dictation, etc.), record a 5-minute sample of your speech and transcribe it. Count the word-level errors. Then deliberately record a sample with atypical speaking patterns—faster pace, softer volume, or more informal language—and transcribe again. Compare accuracy rates. This reveals how variation in your own speech affects the system. If accuracy is below 85%, explore speaker adaptation options or consider human transcription for critical content.
Peri can explain this concept, give practical examples, help you decide whether it applies to your situation, or recommend a journey if appropriate.
Explore related journeys or tell Peri what you're working through.