Here at Tech Brew, our editorial process heavily features the AI transcription software Otter, through which we run recordings of our interviews. It’s not perfect, by any means—our account has an inexplicable preoccupation with mishearing words as “Covid,” for one (more on that later)—but it definitely beats the arduous process of transcribing by hand.
As tech companies seek to make generative AI commonplace in everyday home and work life, speech recognition is a sizable part of that mission, whether it’s summarizing meetings, conversing with voice assistants, real-time translation, or even taking orders at drive-thrus. And ever since deep learning breakthroughs in the 2010s made tools like Otter widespread and then transformers improved them further, this capability has been taken somewhat for granted—it’s maybe not the flashiest subfield of AI in 2025.
But George Saon, a distinguished research scientist who leads IBM’s speech strategy, told Tech Brew it’s a mistake to treat speech recognition as a solved problem. Issues like discerning overlapping voices, making out different accents, and dealing with background noise can still make real-world use cases dicey, he said.
“It’s interesting, because every few years, someone claims that speech recognition has been solved, that we’re at the point where we’re just as good as humans, so we have to move on,” Saon told us. “And at IBM, we don’t believe that. We think there’s still a ways to go to get there. And the reason is that speech is complex, because it has many dimensions of variability.”
At the time of this conversation, IBM’s latest Granite speech recognition model was the leading system on Hugging Face’s leaderboard (it’s now in second place). Saon said his team has zeroed in on a number of improvements that have boosted their models’ accuracy, from adding more robust training data across different dialects to using LLMs to suss out the context of a given conversation.
Vocal fry with that: In 2021, IBM and McDonald’s announced a partnership to test automated drive-thru ordering in certain restaurants. After a nearly three-year test, the fast food chain scrapped the technology last year.
Saon said the task of taking orders was fraught with challenges. There was the background noise: McDonald’s locations might be near noisy airports or train stations; fire trucks and ambulances could screech by. Sometimes, drive-thrus had two side-by-side booths and orders would be confused between them. The system could also mistakenly pick up on side chatter within the cars.
Different dialects and accents make understanding more difficult as well. “Our models now are multi-dialect,” Saon said. “They can work just as well for Australian English, Indian, UK—whatever accent you want—that works now, but that wasn’t the case awhile ago.”
It’s generally harder to parse speech than written text, Saon said. “Then you have, um, people, when they, they talk—they—it’s not like written text, right? It’s—they’re often agrammatical, they, they, they have repetitions, deletions, hesitations. It’s, it’s, it’s messy,” Saon said. (This quote was transcribed exactly, and it’s what most of our quotes would look like without the light editing out of verbal tics we do for nearly everyone.)
Getting noisy: IBM is exploring how LLMs can help voice recognition systems better understand the context behind a given audio clip, Saon said. For instance, IBM’s model could guess the correct spelling of the golfer Jack Nicklaus’s name if it ascertains that a conversation is about golf. (For the record, Otter’s transcription got the spelling right half the time.)
One surprising revelation was that integrating multilingual LLMs allowed the model to interpret code-switching—in this sense, shifting between languages mid-sentence or phrase—without any additional training, Saon said.
In order to prime the model for some of those problems it might encounter in the real world, Saon said the team will introduce noises, distortions, and gaps into the training data. He said there’s no algorithmic workaround for simply adding more training data in the different ways that people speak or audio is captured.
“We take training utterances, and we artificially add various types of noises from babble speech or cars, or all sorts of things, to make the model more robust to noisy conditions,” Saon said. “So data variety…and perturbation, augmentation of the acoustic signal.”
Improving this accuracy is becoming more important as generative AI becomes more multimodal—able to field requests by voice or visually in addition to text. While text is still the dominant way that people interact with LLMs, companies want to add voices to AI for everything from smart speakers to customer service reps—and drive-thru orders. Contextual understanding could unlock new gains in making these systems more precise, Saon said.
“Context is everything,” he said. “Contextual [automatic speech recognition] is going to be a big thing in the future.”
As for Otter’s penchant for mistakenly hearing “Covid” with disproportionate frequency, we reached out to the company to try to understand the AI’s behavior. Otter spokesperson Mitchell Woodrow took the issue to Otter’s engineering team and came back with a likely explanation: Otter may be fixated on Covid simply because we as a society have also been for much of the past half-decade—and the model learns from us.
“In cases of unclear audio, [automatic speech recognition] models may default to highly recognizable terms like ‘Covid,’ Woodrow said. “The term ‘Covid’ became one of the most frequently spoken words globally in recent years, which means it's statistically more likely to appear when the model is uncertain. Our system is designed to continuously adapt, which allowed us to implement a quick fix in this case.”