Photo by Toa Heftiba on Unsplash

Lessons I Learnt From Launching A Voicebot

I wanted to write a definitive guide based on personal experiences while launching a voicebot. In this article I address the discussion of launching a voicebot, converting a chatbot to a voicebot, defining and measuring success, Word Error Rate, and more.

Cobus Greyling
7 min readAug 15, 2022



Voicebots are much harder to implement than a chatbot, this is due to a few reasons.

The first being that voice is synchronous, no delays or silence during the customer interaction should exist. Silence is often caused by API or any backend lookup latency. Or silence can be caused if the voicebot is not resilient in handling exceptions when the user is quiet or some application exception is encountered.

The second consideration is the absence of face speed, where we glean subtle cues from each-others’ faces during conversation, more on the article here below.

A third consideration is design or conversational affordances which are invisible, unlike a chatbot were the user is presented with marquees, buttons, menus, etc.

A voicebot works better if it has an open input approach, and does not follow a voice enabled IVR menu or keyword approach. The more verbose the user input (within reason of course) the higher the level of accuracy. Single word input is hard in terms of assigning intents.

And lastly, obviously we as humans manage to have fairly successful conversations over a phone call. With these conversations there are predominantly four phases…

The first being pleasantries, how are you, how’s the family, thanks for taking my call, etc.

The second is establishing intent by one of the two parties.

The third is the business end of the conversation. Here we as humans naturally manage dialog turns, often agreeing on “who goes first”. Barge-in also takes place, where a speaker accepts a barge-in, or insists on first completing their dialog turn. When there is silence in a conversation, we have ways of soliciting a response and test the connection.

These elements are the hardest to create in a conversation and is the area where most voicebots fall short.

Fourthly we close-off the conversation and mutually agree to end the call.

Going From A Chatbot To A Voicebot

A while back I wrote this article on considerations when extending a chatbot to a voicebot. Specific design considerations come into play with regard to voice.

I often use the schema above when describing a voicebot, and in theory a voicebot is a mere extension of a text based chatbot. And the allure is there to consider the process as just adding TTS and STT to the chatbot and a voicebot comes into existence.

These speech components are:

1️⃣ Text to Speech (TTS, Speech Synthesis). This is the process of converting text into a synthesised, audible voice. This voice is played back to the user, hence it constitutes the portion of the architecture where the voicebot speaks to the user.

2️⃣ Speech To Text (STT, ASR, Automatic Speech Recognition) is the process of converting speech audio into text.

This text is in turn passed to the NLU engine. STT providers do provide a base language pack, which performs quite well, but again with a limited amount of training data, great strides in ASR accuracy is possible.

Read more on TTS and STT considerations here…⬇️

Measuring Success

Obviously the objective of launching any product is success, and in Conversational AI there are many failed projects, where the reporting was misleading.

Below is a link to a detailed article I wrote on measuring chatbot and voicebot success.

But in summary, success can be broken down into three parameters:

  1. User Experience
  2. User Interface
  3. System Metrics

User Experience

User Experience is often measured by CSAT and NPS scores, this is not wrong and serves as an indicator of user’s sentiment or feeling after they have made use of the user interface. But these measures are subjective from a voicebot perspective for two reasons…

The first being that it measures how well the user was assisted in their particular query or problem. Bad ratings in this department is often related to the wider enterprise with regard to the product. The user’s overall bad experience surfaces in this touchpoint. A way to improve this, is by collaborating with product owners across the enterprise, UX experts for other customer facing mediums, product support, etc. An organisation wide improvement of products and services will have a positive impact on these measurements.

Secondly, the measurement is often gamed by teams in order to gather more positive feedback. For example only users with a propensity to rate their experience positive are surveyed.

User Interface

The user interface can be measured on a few fronts. For instance, one measurement is containment, how many customers exit the automated voicebot for a customer care consultant.

Containment needs to also be measured across customer contact mediums. A good measurement is to see how many customers made contact via the voicebot, but then subsequently made use of another avenue to have a query resolved within 10 days or less.

Other elements which need to be part of the UI are disambiguation, digression, handling of silence, background noise, etc.

System Metrics

System Metrics include measurements like Word Error Rate (WER), percentage of none or missed intents, etc.

Word Error Rate

The accuracy of STT is measured in Word Error Rate (WER). Word Error Rate calculates how accurate speech is transcribed in to text. The diagram below explains the basic principle of calculating WER.

One can argue that a good STT base model can yield low insertion and deletion scores. Where most base models fall short is substitutions. Fine-tuning is necessary of the base model to improve the substitutions score.

The fine-tuning data can consist of:

  • Text examples of utterances
  • Or recordings with human transcribed text.

This helps the STT system to recognise industry specific jargon in the case of banking, telecoms or any other industry specific implementation.

Vendors lay claim to WER scores of 5% and lower. As I have stated before, in my experience a good WER for non-native speakers of a language is in the vicinity of 15% to 18%.

In order to have a representative WER score, the following steps need to be considered…

  • The total number of recordings used as a baseline to establish the WER score, need to be minimum 30 minutes in length to be truly representative.
  • Ensure the recordings used are representative of the environment the user will most probably use the voicebot in, in terms of background noise and general setting.
  • Recordings used for WER benchmarking must also be captured via the same medium the voicebot makes use of. If the voicebot will be accessed via a telephone call, use phone recordings for the WER baseline recording set.
  • WER baseline recordings must be representative of the user base in terms of age, gender, ethnicity, regional accent, etc.
  • Industry and regional specific words must be used in a representative fashion, for instance place names, first and last names, names of products and services and industry specific technologies and terms. There needs to be a representation of speech aberrations like hesitation, self-correction, filler words, etc.
  • Obviously the elements mentioned here, which must be present in the WER benchmarking recordings, should also be part of the STT fine-tuning (acoustic) model training recordings.

Read more here.


The voice squad needs to be truly agile. There should be daily transcript reviews attended by the whole squad, joint transcript reviews build a collective and shared understanding within the squad of how the voicebot is fairing.

Transcript reviews also give insight into the true customer experience, literary word-by-word. Quick, small iterative improvements are key, within sprints ample capacity must be allocated to maintenance and optimisation.



Cobus Greyling

I explore and write about all things at the intersection of AI & language; LLMs/NLP/NLU, Chat/Voicebots, CCAI.