What Makes a Great Conversational AI Experience?

You are not alone if you have ever experienced a poor interaction with a voicebot. Perhaps you heard, "Sorry, I didn't get that. Can you repeat your request?" or have been transferred to a human agent after a simple question. Unfortunately, these experiences are common across various industries, resulting in a negative customer experience and ultimately cause churn. When you look at the leading edge of Conversational AI, what is currently possible? Well, we are getting much closer to having voicebots that converse like humans, but only on specific subjects or use cases. There is still not a voicebot that you can universally converse on everything or be a real personal assistant.

The need to be persistent

So what are the obstacles to human-like conversational AI? Many technical and data obstacles exist to reach this goal. Kevin Fredrick, Managing Partner of OneReach.ai, expressed this best when he said, "Building a Conversational AI voicebot is like planning to summit a mountain. Those who are looking for an 'easy button' get frustrated and quit. The ones who think it will be too hard, don't ever start. It is the ones who know the challenge is worth it and have the right partners and use the right tools who make the summit." There are still technical challenges we continue to overcome including transcription speed and accuracy optimization, better Natural Language Processing (NLP) and Natural Language Understanding (NLU), improved human-like text-to-speech engines, and tighter integrations between all the parts of this workflow, but we see the path to reaching this summit.

Lack of training data

On the data side, Antonio Valderrabanos, CEO of Bitext, indicated that the availability of data for AI model training and evaluation is one of the main challenges for creating a voicebot for all languages, accents, dialects, and use cases. Do we have the audio and text data in these accents, dialects, and use cases to train that AI model? This data currently does not exist in the public domain or even in the private domain, so this training data must be built in a scalable way. Valderrabanos believes we can get there but there needs to be automated methods for data generation, for both training and evaluation.

Why is it harder to create voicebots vs. chatbots?

As Adam Sypniewski, CTO of Deepgram noted, there is no plug-and-play with voicebots. You can't just unplug the chat and install automatic speech recognition into the Conversational AI workflow and expect it to work. Texting and chatting can be looked at as one-dimensional while speech is multidimensional. You have different tones of voice meaning different things. You have pauses in a conversation, which may or may not mean you are done speaking. You have different words that all mean the same thing; like "Yes", "Yeah", "Sure" or "Uh-huh". Wait, did he say "Uh-huh" as meaning non-commital or as meaning affirmative? This is just English. What about English as a second language speaker accents or different languages with different expressions? Simple transcripts from an automatic speech recognition (ASR) system will not pick up these differences and they may not get the words correct. Yeah, there is no easy button but these cutting-edge companies are climbing that mountain.

Light at the end of the tunnel

Yes, there is no easy button to a great overall Conversational AI voicebot but very good single-use case voicebots are available now. Our experts all agree we are going to see a big evolution in this space in the next two to three years that will evolve to that universal voicebot or personal assistant that you can speak with as a friend.

Want to hear from the experts

Our on-demand webinar with Bitext, OneReach.ai, and Deepgram discusses why customers are rejecting simple IVRs, chatbots, and menu-driven voicebots to embrace more human-like bots that understand the customer intent and respond correctly to meet customer needs. They also discuss what you need to consider in creating that great voicebot. View the on-demand webinar.

Questions from the audience

1. Any specific deployed examples of successful voice-based IVR. Measuring the success of a voice-based IVR boils down to three things:

Adoption - Are people using it?
Satisfaction - Are you enhancing relationships with users?
Task completion - Are you providing value?

Successful IVRs have less to do with the specific tools being used, and more to do with the outcome that those tools provide. A leading consumer brand was retiring a fictitious marketing character, and through this campaign to retire it, they created an experience that allowed fans to call and talk to the character. Designing for the nuances of voice conversations is always challenging, but even more so when designing for a conversation with a fictitious character. The average call for this experience was around 7 minutes long, and this goes to show that when the design is well-thought-out, you can create experiences that are really engaging and fun.

2. Often, people have to struggle to understand second language speakers (accents, grammar, misused idioms, etc.). Now does NLU/P handle L2 speakers? The key is having training data that reflects the language specificities of L2 speakers. L2 is not harder to handle than L1, it's a different language accent or variety. Natural Language Generation (NLG) or automatic data generation is the solution to dealing with these and other language variants.

3. What are your thoughts on Spotify's new patent in emotional speech recognition to choose music for the user based on how they say things? Detecting sentiment, emotion, or speaker attributes like age are interesting areas of research both for their technical merits, but also for the ethics surrounding training and using features like that. This is a necessary field though because truly understanding intent in a conversational setting will require at least some understanding of emotional cues and guesses about speaker attributes. Much of natural conversation can't be captured merely as words so ignoring intonation, inflection, speaker context and the emotion of what was said removes much of the meaning and intent. There are many challenges here that need to be solved. In training data alone there are large issues. For example, one person's 'rudeness' is another's 'assertive'. The same person may give changing estimates of emotion for the same example based purely on their own current emotional state. Data like this is both hard to gather and hard to train with and that is just the beginning of a pipeline that would deliver real value in a production setting. We are just at the beginning of these technologies. The next few years will see the first real trailblazing, successful, implementations and, more important than the applications they are a part of, the first few implementations will start to formalize the problems. That will allow follow on work that will bring this truly to the mainstream.

4. Hi guys, in what you describe (esp. Adam), isn't the way we architecture the conversational systems in separated silos (ASR->NLU->NLG->TTS) one of the main issues? Right now the problem is too big to bite off in one single system. Going from audio to action is just too big of a problem which means we have to break it into smaller, solvable, pieces. Getting those pieces right at a high level is crucial to the success of a project. Luckily we are seeing rapid improvements in all parts of the pipeline and, as a consequence, really starting to understand what needs to be passed between each of them to effectively handle the problem. As these technologies evolve and mature the roles and connections between these pieces will likely shift and blur more to allow each to get access to the information it needs. We are also likely to see some of these pieces merge or change function radically as the problem clarifies and the solutions become more capable. So, yes, the current silos are a major part of the problem, but they are enabling a rapid evolution in the space that is leading to rapid improvement.

5. Where can someone go to learn more about predictive analytics? The founder and chief technologist of OneReach.ai is writing a book called Age of Invisible Machines that explores how interfaces are disappearing and conversation is taking over. The book goes into some incredible detail regarding the big ideas, such as predictive analytics, but also explains how to think practically and apply academic research to the real world. In addition, many vendors in the space would be happy to set up an executive review to do a deep dive on how they think about specific topics. These will be the people who are really thinking about the topics and have important best practices to share, so I would recommend that as well.

6. Most chatbots today are task-oriented. How can one make bots that are able to work with/detect things like soft skill etc. so as to make it more like humans? The current answer may be the integration of different voicebots, one for handling content (the classical bot we know now) and one for handling soft skills (tone, politeness, use of colloquial language...) Will we get to a universal conversational voicebots that can speak on many topics and handle all soft skills? Someday but we may need more optimization in hardware processing in addition to optimization on the overall Conversational AI pipeline.

If you have any feedback about this post, or anything else around Deepgram, we'd love to hear from you. Please let us know in our GitHub discussions .