Read This Before Converting Your Chatbot To A Voicebot

There Are Telling Differences Between Text and Voice Interfaces

Cobus Greyling


Conversational Components

Firstly we need to think of any conversation, be it in voice or text, of being presented to the user by means of conversational components.

Conversational Components are the different elements (or affordances if you like) available within a specific medium.

For instance, within Facebook Messenger there are buttons, carousels, menus, quick replies and the like.

Facebook Messenger with Feature Rich Conversational Components

These allow the developer to leverage these conversational components to the maximum hence negating any notion of a pure natural language conversational interface.

It is merely a graphic interface living within a conversational medium. Or, if you like, a graphic user interface made available within a conversational container.

Is this wrong…no.

It is taking the interface to the user’s medium of choice. Hence making it more accessible and removing the friction apps present.

The problem with such an approach is that you are heavily dependent on the affordances of the medium.

Should you want to roll your chatbot out on WhatsApp, the same conversational components will obviously not be available, and you will have to lean more on NLU, keyword spotting or a menu driven solution. With a even more basic medium like SMS are are really dependent on NLU or a number/key word driven menu.

WhatsApp With Far Less Conversational Components

Where the menu is constituted by keywords or numbers, the menu will have to be a key word or number the user need to input to navigate the UI.

Is it a chatbot?

Technically, yes; as it lives in a multi-threaded asynchronous conversational medium.

But is it conversational in nature? One would have to argue no.

So in the second example a rudimentary approach needs to be followed for one single reason, the medium does not afford the rich functionality and is primarily text based.

With Voice, these components disappear completely. With only the natural language component available within the Voice User Interface (VUI) environment there are no components to leverage with the affordances are invisible.

Invisible Affordances

Voice User Interfaces allow users to interact with an interface via voice or speech. Digital assistants (Siri, Google Home, Alexa) afford the user the freedom to use speech without touch, or any Graphic interface.

Voice User Interface Transitioning To Text

Often, where there are no visual affordances, no implicit guidance can be given to the user. Due to the nascent nature of Voice Interfaces, it is virtually impossible to leverage exiting experience or knowledge of the user.

Due to the strong association of conversations with humans and not machines, the expectations of the user is often much higher than the capability of the interface. Hence in most cases the user output exceeds the design of the interface.

There is a certain level of patience and perseverance required from the user to build an affinity and sympathy.

Voice Access Device

Research has shown that users opt for speech input while driving, or at home. Wherever the user feels the need to have a private conversation, they opt for text. This is usually at work, public transport, in general waiting areas or in public.

Amazon Echo Alexa Integration To The Mercedes-Benz Car API

Users will expect to have a seamless experience between the different mediums of voice and text.

An element which should not be underestimated is the advances in microphone technology when it comes to the Amazon Echo device and Google Home. This allows for accurate ASR to be performed on the voice prior to intent and meaning discovery. Some companies are looking at having a conversational experience over a telephone call; which is not really feasible. A mobile app will perform much better, but still not as good as a dedicated device.

Voice Is Ephemeral

Conversation which live within a text chat environment like WhatsApp, Messenger etc, affords the user the opportunity to refer back to conversation history. This allows for text conversations to be multi-threaded and asynchronous. While users are chatting with your chatbot, they are simultaneously chatting to other people or bots.

Amazon Echo Speech Interface

A voice conversation is not multi-threaded from a user’s perspective. The user can only have one conversation at a time. But more importantly, it is synchronous. The conversation is usually started and ended and is conducted in-time. The duration of the voice conversations is normally shorter also.

By ephemeral I mean that the smart assistant dialog evaporates, hence studies show that consumers are more likely to transact on text bots than voice bots. While it is true that users can refer to a companion app for the conversation transcripts, this is not done during the conversation.

often a chatbot can dispatch a number of speech bubbles, each containing quite a bit of content. This will not work for a voice bot. For voice, the speech from the voice bot will have to be one utterance; one speech bubble if you like, and the text must be shortened.

Some elements like URL’s, phone numbers, addresses etc will not suffice and in such cases an user needs to be referred to a companion app. Or the information can be texted, for instance.

Amazon Echo With Different Languages

Any graphic display presented by the text bot will have to be removed for the voice experience. Should the voice bot have a display (like Google Home Hub or Echo Show), the message will have to be adapted for the medium.

Also keep in mind that users cannot really refer back to previous conversations. So as with a text bot it might make sense to preserve the state…In the case of voice, it will most probably be more helpful if the voice bot starts a new conversation from scratch.

Radio Silence

API integration touch points should have sub-second response times for a voice interface. Silence during a conversation is disconcerting from a user’s perspective and often leads to a believe that the conversation has been terminated.

Multi-Modal Experience with Amazon Echo Show

This is analogousness to an IVR (Interactive Voice Response) system fronting a call center, where any on-call silence in the IVR application during a lookup is highly undesirable.

During a text chat conversation, the user can be entertained by “is typing” displays, as with a human. And, we are use to waiting on the human on the other side to respond in text after receiving our message. Hence there is an expected delay.

Hence the delay is not that critical when it comes to text based conversations. A slight delay in delivery of the dialog or text is again not as critical; however, in the case of voice a delay is very counterproductive.


There is this allure that a text based bot (chatbot) can merely be converted into an API and be utilized by voice interfaces, be it IVR, Amazon Echo, Google Assistant and the like.

But in practice this is not the case. As stated at the outset of this story, you will always have a mediation layer. This mediation layer will sit between your Chatbot API (NLU engine) and the presentation layer comprised of the mediums.

The chatbot API output will be received by the mediation layer and rendered in a format digestible by medium it is intended for. The medium being Facebook Messenger, WhatsApp, Text/SMS, web chat etc.

For voice interfaces, more work will be required to ensure the text to be spoken back to the user is suitable and consumable in an audio format.

Photo by This Guy on Unsplash



Cobus Greyling

I explore and write about all things at the intersection of AI & language; LLMs/NLP/NLU, Chat/Voicebots, CCAI.