Generic ASR will never be accurate enough for Conversational AI

The human brain is amazing in terms of how we can process speech and understand what is said. If we are talking about a baseball game, your brain understands that when I say "pitcher" and "batter", I don't mean a large vessel for pouring drinks and a mix to cook pancakes. Your brain matches the words to the context and the intent of the conversation. Your brain also has an amazing noise filter to focus on the important parts of a conversation. If you are at a baseball game, there is constant noise around you but when your buddy talks to you, you can focus on his voice, hear him and understand him clearly.

Intent Matters

How does a Conversational AI system determine the intent of the conversation and focus on the important words? Let's talk about a possible future Conversational AI example. Imagine a robot waiter at a local pub. There are four conversations going on around it. The booth to its left is talking about a weird internet video. A table behind it is complaining about the last place the group ate and how bad the chicken was. And finally, the table in front of it has delegated the task of ordering appetizers to the person at the back of the table, with everyone throwing their requests their way. Given a one hundred percent accurate transcript of audible conversation at the table it would be really hard for the robot to understand what should be happening here. Did they just order chicken tenders or was that the other table? Was that two orders of the appetizer or was that first person asking the other person to order it? Was that 'mh-uh' a no they don't want the biggie sized version or was it just a throat clearing?

Accuracy with Intent

In the example above, a generic Automatic Speech Recognition (ASR) could transcribe all the audio in the immediate area of the robot and get a jumbled transcription, all the words may be accurate with a 10% Word Error Rate (WER) but does that help a Conversational AI system to understand what was said or respond appropriately? You need accuracy that is focused on the important keywords and the intent of the conversation to gain understanding. ASR for Conversational AI and voicebots cannot be generic, it will never be accurate enough. ASR must have a speech model tailored for the intent of conversation and can focus on the keywords important to understanding. This tailored approach helps to remove background noise and speech that is not part of the intent of the conversation. So, what type of ASR is able to be tailored to your Conversational AI, an End to End Deep Learning ASR. This type of ASR can be trained with your audio data to make sure the intent is captured and the transcription is accurate for your use case. It can also be continually trained and improved to gain more accuracy and focus. Hear more about how "Generic ASR will never be accurate enough for Conversational AI" in our speaking session at the 2021 Conversational AI and NLP summit by RE WORK. Sign up here.

If you have any feedback about this post, or anything else around Deepgram, we'd love to hear from you. Please let us know in our GitHub discussions .