‘Hey Alexa’ no more? A new feature deemphasizes Alexa’s ‘wake word’

How Amazon built Conversation Mode, Alexa’s new continuous dialogue feature.

February 2, 2022

• 5 min read

“You talkin’ to me?”—Robert De Niro in Taxi Driver…and Alexa, pre-Conversation Mode, a new feature that uses AI, along with a slew of audio and visual cues, to determine whether or not a user is directly addressing the voice assistant.

When Conversation Mode is enabled, one or more users can signal Alexa and start a continuous dialogue, rather than repeating the wake word each time they address the assistant. Two months after the feature’s rollout, it’s available in the US on the Echo Show 10 device, which has a screen and camera. Amazon wouldn’t disclose how many users have enabled Conversation Mode, or how many Echo Show 10 devices it has sold.

Under the hood

In 2020, Arindam Mandal, director of conversational AI at Amazon, and his team of engineers and speech scientists began work on the feature, which involved equipping Alexa with AI capabilities it had never offered before: combining visual cues, computer vision, natural language processing, and contextual understanding to make the kind of instant judgment calls that humans make automatically.

At any given point in the process, he estimates, 30 to 40 people were working together to get Conversation Mode off the ground.

To begin with, the team had to decide exactly how Alexa would use the Echo Show 10’s camera and audio inputs to determine whether a user was speaking to it. The typical way to do this is via “supervised learning” AI: using tons of examples, annotation, and a…supervised approach to teach a model what’s what. But since that strategy can be time-consuming, they decided to simulate the necessary data.

picture of the Echo Show 10 amazon device — Amazon's Echo Show 10 device (Amazon)

With simulation data, it took about 18 months from conception to rollout. Had they gone the other route, it would’ve cost the Alexa team an additional two years to pull off, Mandal said.

Mandal said this path resulted in a lower false-rejection rate than traditional supervised methods, but declined to provide the rates for either approach, or for Alexa pre- and post-Conversation Mode. Amazon also declined to provide Conversation Mode’s false positive rate. In this context, a false rejection is when Alexa ignores you even when you’re addressing the device, and false positive is when Alexa listens even though you did not address it.

The team sourced examples (like people looking directly at, or away from, a camera) from open-source datasets, then vetted them alongside Amazon’s user experience team for gender and racial bias. After that, they multiplied the examples, ultimately generating a synthetic data set of 3D heads.

“As soon as you can remove these restrictions of needing to collect data, it’s very enabling, and the technology moves very fast,” Mandal told us. Later, he added, “Simulation is actually a key invention for us in order for us to move away from dependency on large, supervised data sets.”

From there, the team also had to solve for computational load—if there was too much involved, then Alexa would function too slowly for a normal conversation. So they developed a shortcut in which Alexa was given a hierarchy of signals—think: someone looking directly at the camera, someone completely looking away, and everything in between—and weighted each one according to how likely it was that Alexa was being addressed. They fed those weights into a single neural network, and that network allows Alexa to avoid going through complex calculations each time it makes a judgment call.

Then came the spoken language approach—which involved a similar way to lighten Alexa’s computational load—and the contextual understanding aspect.

Say you’re asking Alexa for a list of rom-com options for that evening, and the device lists 10 Things I Hate About You, When Harry Met Sally, and Always Be My Maybe. A diehard Meg Ryan fan might interrupt Alexa after the second option and say, “That one!” Before Conversation Mode, the device wouldn’t have understood that phrase in context, but now Mandal says it will, thanks to Alexa’s new anaphora resolution skill—deciding which named entity in a list the user is referring to, then passing that along to an API that can pull up the 1989 classic.

A snag…and a solution

In late 2020, Conversation Mode entered the world with a pizza party. Two people sat with an Echo device and walked through a pizza-buying experience: discussing where to order from, listening to a list of toppings to choose from, the works.

Mandal’s team, along with a handful of Amazon’s senior leaders, watched the first live demo from home.

“We saw it working in real life for the first time—and it was incredible to see,” Mandal said. “None of our devices, or anyone in the industry, had anything like that, that would work with [that] latency…It received a lot of encouraging signals.”

But by summer 2021, the tech had fallen short of expectations. Internal trials with more users suggested that Alexa wasn’t understanding spoken and visual cues as well as planned—a “frustrating moment,” Mandal recalled. That lack of responsiveness meant users had to repeat themselves in order to keep the conversation going.

Over the next three months, the team worked on a fix: source localization. Like a heat map of sorts, it helped Alexa gauge where the speech was coming from in relation to the device. For example, speech coming from directly in front of the device is likely more relevant than speech coming from far behind it.

“Once that capability was brought in, we were able to greatly tighten up the false rejection, false acceptance tradeoffs,” Mandal said. “That helped us move all of our key performance indicators into the launch territory from the cautious red zone it was in before.”

In September, the source localization work wrapped up—just in time for its November debut.

Keep up with the innovative tech transforming business

Tech Brew keeps business leaders up-to-date on the latest innovations, automation advances, policy shifts, and more, so they can make informed decisions about tech.