Design Different For Voicebots Versus Chatbots
…and Why You Cannot Just Voice Enable Your Chatbot
Introduction
Digital Assistants usually first manifested via text mediums…
Each chatbot saw the light of day on a specific text medium of choice. This might have been Messenger, SMS/Text, WhatsApp and so on. Often dictated by the geographic region or demographic’s chat platform of choice. Or what the customer base is using.
From this initial text medium of choice, the chatbot spraws into other text-based mediums.
As this digital employee and assistant grows in functionality, intelligence and text-based mediums, the natural next step will be voice. Hence, giving users the option to access the virtual agent via voice.
Voice interfaces can be made available via a voice-first device like like Amazon Echo (Alexa), Google Home or even a traditional telephone call.
Just as design considerations were paramount moving from one text medium to another, definite design decisions are necessary for sprawling into voice.
Basic Enabling Elements
There are three basic elements required to take a chatbot to a voicebot.
Gateway
One being a gateway which can be a telephone network or integration to a digital assistant interface.
If you are familiar with the Alexa Developer Console these terms will be familiar to you.
Speech To Text (ASR)
The second being ASR (Automatic Speech Recognition), also known as Speech-To-Text (STT).
This is a very mature technology and widely used. For example, Google’s voice search.
Essentially turning speech from an audio format into text; verbatim.
The challenge with STT/ASR is that it is language dependent. So it will be astute design to perform language detection upfront and advise your users if you cannot accommodate them. If you find yourself having to address a niche language, regular commercial cloud solution will not suffice.
But don’t despair, there are alternatives.
Speech Syntheses
Thirdly you need to convert text into speech. Known as Text To Speech or Speech Synthesis. Here language again plays a pivotal role.
There are products available allowing you to create your own custom voice from a voice artist’s recording, reading a few pages of text. It is important that the voice is natural sounding and unique to your organization; if possible.
Being able to spruce up the synthetic speech with SSML is a must to add emotion, intonation etc.
Invisible Affordances
Voice User Interfaces allow users to interact with an interface via voice or speech. Digital assistants (Siri, Google Home, Alexa) afford the user the freedom to use speech without touch, or any Graphic interface.
Often, where there are no visual affordances, no implicit guidance can be given to the user. Due to the nascent nature of Voice Interfaces, it is virtually impossible to leverage exiting experience or knowledge of the user.
Due to the strong association of conversations with humans and not machines, the expectations of the user is often much higher than the capability of the interface. Hence in most cases the user output exceeds the design of the interface.
There is a certain level of patience and perseverance required from the user to build an affinity and sympathy.
Voice Access Device
Research has shown that users opt for speech input while driving, or at home. Wherever the user feels the need to have a private conversation, they opt for text. This is usually at work, public transport, in general waiting areas or in public.
Users will expect to have a seamless experience between the different mediums of voice and text.
An element which should not be underestimated is the advances in microphone technology when it comes to the Amazon Echo device and Google Home. This allows for accurate ASR to be performed on the voice prior to intent and meaning discovery. Some companies are looking at having a conversational experience over a telephone call; which is not really feasible. A mobile app will perform much better, but still not as good as a dedicated device.
Voice Is Ephemeral
Conversation which live within a text chat environment like WhatsApp, Messenger etc, affords the user the opportunity to refer back to conversation history. This allows for text conversations to be multi-threaded and asynchronous. While users are chatting with your chatbot, they are simultaneously chatting to other people or bots.
A voice conversation is not multi-threaded from a user’s perspective. The user can only have one conversation at a time. But more importantly, it is synchronous. The conversation is usually started and ended and is conducted in-time. The duration of the voice conversations is normally shorter also.
By ephemeral I mean that the smart assistant dialog evaporates, hence studies show that consumers are more likely to transact on text bots than voice bots. While it is true that users can refer to a companion app for the conversation transcripts, this is not done during the conversation.
often a chatbot can dispatch a number of speech bubbles, each containing quite a bit of content. This will not work for a voice bot. For voice, the speech from the voice bot will have to be one utterance; one speech bubble if you like, and the text must be shortened.
Some elements like URL’s, phone numbers, addresses etc will not suffice and in such cases an user needs to be referred to a companion app. Or the information can be texted, for instance.
Any graphic display presented by the text bot will have to be removed for the voice experience. Should the voice bot have a display (like Google Home Hub or Echo Show), the message will have to be adapted for the medium.
Also keep in mind that users cannot really refer back to previous conversations. So as with a text bot it might make sense to preserve the state…In the case of voice, it will most probably be more helpful if the voice bot starts a new conversation from scratch.
Radio Silence
API integration touch points should have sub-second response times for a voice interface. Silence during a conversation is disconcerting from a user’s perspective and often leads to a believe that the conversation has been terminated.
This is analogousness to an IVR (Interactive Voice Response) system fronting a call center, where any on-call silence in the IVR application during a lookup is highly undesirable.
During a text chat conversation, the user can be entertained by “is typing” displays, as with a human. And, we are use to waiting on the human on the other side to respond in text after receiving our message. Hence there is an expected delay.
Hence the delay is not that critical when it comes to text based conversations. A slight delay in delivery of the dialog or text is again not as critical; however, in the case of voice a delay is very counterproductive.
Conclusion
There is this allure that a text based bot (chatbot) can merely be converted into an API and be utilized by voice interfaces, be it IVR, Amazon Echo, Google Assistant and the like.
But in practice this is not the case. As stated at the outset of this story, you will always have a mediation layer. This mediation layer will sit between your Chatbot API (NLU engine) and the presentation layer comprised of the mediums.
The chatbot API output will be received by the mediation layer and rendered in a format digestible by medium it is intended for. The medium being Facebook Messenger, WhatsApp, Text/SMS, web chat etc.
For voice interfaces, more work will be required to ensure the text to be spoken back to the user is suitable and consumable in an audio format.
Conversational Components
We need to think of any conversation, be it in voice or text, of being presented to the user by means of conversational components.
Conversational Components are the different elements (or affordances if you like) available within a specific medium.
For instance, within Facebook Messenger there are buttons, carousels, menus, quick replies and the like.
These allow the developer to leverage these conversational components to the maximum hence negating any notion of a pure natural language conversational interface.
It is merely a graphic interface living within a conversational medium. Or, if you like, a graphic user interface made available within a conversational container.
Is this wrong…no.
It is taking the interface to the user’s medium of choice. Hence making it more accessible and removing the friction apps present.
The problem with such an approach is that you are heavily dependent on the affordances of the medium.
Should you want to roll your chatbot out on WhatsApp, the same conversational components will obviously not be available, and you will have to lean more on NLU, keyword spotting or a menu driven solution. With a even more basic medium like SMS are are really dependent on NLU or a number/key word driven menu.
Where the menu is constituted by keywords or numbers, the menu will have to be a key word or number the user need to input to navigate the UI.
Is it a chatbot?
Technically, yes; as it lives in a multi-threaded asynchronous conversational medium.
But is it conversational in nature? One would have to argue no.
So in the second example a rudimentary approach needs to be followed for one single reason, the medium does not afford the rich functionality and is primarily text based.
With Voice, these components disappear completely. With only the natural language component available within the Voice User Interface (VUI) environment there are no components to leverage with the affordances are invisible.