Transcription Accuracy Challenges: When AI Speech-to-Text Fails for Disabled Users

AI speech-to-text systems like Otter.ai have transformed accessibility for people with mobility limitations or those who prefer speaking over typing. However, transcription accuracy varies dramatically depending on speaker characteristics, audio quality, and language variation. For disabled speakers—particularly those with speech disabilities, non-native accents, or atypical prosody—accuracy can be substantially lower than advertised benchmarks.

How Transcription Models Learn (and Fail)

Modern speech recognition uses neural networks trained on massive audio datasets. The training data is typically biased toward certain demographic groups: younger speakers, native English speakers with minimal accents, clear articulation without stuttering or dysarthria, and good-quality audio without background noise. When the model encounters speech patterns absent from training data, accuracy degrades.

This isn't a flaw in the AI's design—it's a statistical consequence of training data composition. If 95% of training audio is from native speakers, the model optimizes for this distribution. When processing someone with a heavy accent, the model assigns lower probability to their actual words because it was trained to expect native speaker patterns. The model isn't 'understanding' speech—it's pattern matching based on statistical probabilities learned from training data.

Specific Challenges for Disabled Speakers

Stuttering and speech dysfluency: Repeated sounds, prolongations, and blocks confuse models trained on fluent speech. "I w-w-want to know" becomes a transcription puzzle because the repetition is statistically anomalous in the training data.
Dysarthria (slurred or unclear speech): Conditions like cerebral palsy, stroke, or Parkinson's disease affect articulation clarity. Models optimized for clear articulation struggle with hypernasality, reduced volume, or imprecise consonants.
Non-standard prosody: Atypical intonation, rhythm, or stress patterns (common in autism spectrum, speech disorders, or L2 speakers) create statistical surprises for the model. The probability calculations diverge from expected patterns.
Variability in speaker characteristics: Some disabled speakers have good days and bad days with speech clarity. A model trained on consistent speakers generalizes poorly to this variability.

Quantifying the Problem

Research shows typical commercial speech-to-text systems achieve 95%+ accuracy for native speakers in quiet conditions, but accuracy drops to 60-75% for non-native speakers and potentially below 50% for speakers with dysarthria or severe stuttering. This accuracy difference is not random—it's systematic, meaning certain words are reliably mistranscribed. For someone relying on transcription for communication or documentation, this is unacceptable.

Interestingly, human transcribers also struggle with similar populations, though through different mechanisms (fatigue, unfamiliarity) than AI. The advantage of AI is that with sufficient speaker-specific training data, accuracy can improve dramatically. This requires adaptation—feeding the system examples of your speech patterns so it can adjust its probability calculations for your specific characteristics.

Strategies for Improving Transcription Accuracy

Speaker adaptation: Many systems allow users to upload voice samples or training audio. The system fine-tunes its model using your voice characteristics. Otter.ai's speaker identification feature can help separate multiple speakers in audio. After processing several of your recordings, accuracy typically improves.

Custom vocabulary: If you frequently use specialized terms, names, or non-standard words, you can add them to a custom dictionary. This biases the model's probability calculations toward words you're likely to say, improving accuracy for your specific domain.

Post-transcription correction: Rather than perfect transcription, AI provides a foundation that humans edit. For people using AAC (augmentative and alternative communication) devices, real-time transcription with human correction loops has proven effective. The user speaks, the system transcribes, the assistant corrects obvious errors, and the corrected output feeds back to improve future predictions.

Audio quality optimization: High background noise disproportionately harms accuracy for speech with lower volume or clarity. Using microphones positioned close to the speaker, noise-cancelling settings, and quiet environments significantly improves transcription quality.

The Equity Problem

Standard benchmarking of transcription accuracy using demographically uniform speakers masks the real-world equity problem. A system reporting 95% accuracy might have 95% for native speakers but 70% for non-native speakers. Marketing emphasizes the former; disabled users experience the latter. Responsible deployment requires testing accuracy across diverse speaker populations and being transparent about group-level performance differences.

Try this: If you use any speech-to-text tool (Otter, built-in dictation, etc.), record a 5-minute sample of your speech and transcribe it. Count the word-level errors. Then deliberately record a sample with atypical speaking patterns—faster pace, softer volume, or more informal language—and transcribe again. Compare accuracy rates. This reveals how variation in your own speech affects the system. If accuracy is below 85%, explore speaker adaptation options or consider human transcription for critical content.