Photo by Davide Cantelli on Unsplash

Measuring Chatbot & Voicebot Success

And Why The Metrics Need To Keep Each-other In Check

Cobus Greyling
7 min readJun 8, 2022



Most Conversational AI frameworks used to create digital assistants (chatbots and voicebots) have an augmented management console with dials and charts on how the chatbot is performing.

Often this makes for an impressive sales slide, or presentation to executives, but it does not always translate accurately what the true quality of the conversational experience is.

Basic Architecture Extension of a Chatbot to a Voicebot

Above, the basic architecture extension of a chatbot to a voicebot, however there are a few design considerations.

Generally speaking the main objectives of the digital assistant are to:

  1. Automate customer experience
  2. Save money and present a compelling return on investment (ROI)
  3. Improve customer experience

One of the reasons telephone / call centre automation and CCAI are taking off is due to the fact that enterprises like banks, telcos, etc. do not want to introduce another channel, they do not want to add another device to their customer experience landscape. They want to automate the existing channels and mediums, of which voice calls constitute the bulk and is the most expensive to handle.


Conversational designs need to be translated into good user experiences.

User experience is how the user feels after using the interface. That is why intent driven development (iDD) is important for accurately judging what users want to speak about.

User desired intents and intents designed and developed need to overlap 100%. A process of Intent Driven Development can help align what’s developed to what the user wants.

A good source of customer intents are existing user conversations, and especially the conversations from the call centre. Research has shown that a single reason for calling customer care, can account for 40% to 60% of call centre calls. If these calls and their subsequent variations can be automated, the system would prove its worth.

An astute tool is required to automate the categorisation of user utterances into intents, sub-intents and possibly intents nested a few levels deep.

Having a list of user utterances like the one below is a good start and extremely valuable. But segmenting these utterances into intents whilst taking into consideration…

  1. Cluster Size
  2. Granularity
  3. Language
  4. Intent Overlaps
  5. Categorisation for sub, sub-sub, etc. intents…

demand an automated tool with the dials and levers to achieve the right results in intent configuration.

Fun Fact

At a large enterprise we took 20,000 recordings. These recordings were meticulously transcribed by humans.

The same 20,000 recordings were passed through an Automatic Speech Recognition (ASR) API, with a custom trained acoustic model.

Both sets of transcripts (ASR and human transcribed) were run against the NLU API, and only a 4% deprecation in intent recognition was found on the ASR transcribed data.

This proves the general accuracy and maturity of ASR, but it needs to be kept in mind that the same recordings where used. The aim of this exercise was to validate and test the ASR concept.

However, it needs to be kept in mind that users interact with a chatbot (text based conversational interface) vastly different than a voicebot. In a voicebot the user utterances will be more verbose, longer inputs and the absence of single word inputs; like in the case of quick replies and buttons.


Too many metrics create confusion, and in many cases the metrics focus on behaviour that will compliment the design of the journey. Elements of journeys completed become important. The danger is always there to default to volume and containment, and if those to indicators are acceptable, seemingly all is well.

But here is what is important when measuring a digital assistant’s success:

  1. What did the customer experience?
  2. Were their intent resolved?
  3. To what extent was the bot pro-active in building contextual awareness and resolve secondary or root-cause issues.

These four metrics keep each-other in check, high containment places pressure on the other three metrics. Improving NPS and CSAT will place pressure on Containment. Negating Returning Customers takes time to build better journeys and taking learnings out of NPS.

These four metrics must be used in conjunction and is really inter-dependant.

The ideal four metrics to employ are:

  • Net Promoter Score needs to be high.
  • Customer Satisfaction Score needs to be high.
  • Containment needs to be high.
  • Returning Customers needs to be low.

And balancing these four, with high NPS, CSAT and Containment, and low Returning Customers on the same channel takes time and effort to get right.

Net Promoter Score (NPS)

NPS can be a very harsh metric, it is often representative of the overall product and service level of an organisation. Hence it is not always representative of a particular channel.

NPS Explained

Touchpoint NPS (tNPS) can be a way of segmenting NPS ratings according the channel and service. And this can be an indication of where opportunities for improvement lies. When Containment is down and users are allowed priority access to agents, then NPS and CSAT are bound to be higher.

If containment is low and no care is given to returning customers, then NPS and CSAT are usually high. But this means the bot is not used and merely acts as an avenue to get to an agent.

Customer Satisfaction Score

CSAT is often used to measure a particular channel or medium, or a particular leg within a chatbot. And used in parallel with NPS. Where NPS can be organisational wide, tNPS attempting to measure a particular medium or channel, and CSAT measuring a leg or journey within the agent.

CSAT can be implemented to remedy a particular portion of the user experience. But is also susceptible to being artificially inflated by dropping containment.


Containment is basically the percentage of users not opting for an agent transfer or call-back. This can be “improved” by simply not giving the user an option to go to an agent, or schedule a callback.

This will lead to good containment figures, but NPS and CSAT will be rock-bottom. The way to improve containment is to create an exceptional conversational experience, and resolve the users query. Merely blocking users from an agent not only drive down NPS and CSAT, but also the number of customers returning.

Containment can be inflated artificially by not allowing users through to an agent, but then the other three metrics will suffer.

Returning Customers

The Returning Customers metric measures if a customer has made use of the digital assistant, and returned or made use of any other channel within a period of seven to ten days.

Users might make use of the chatbot, but then opt to rather make use of a different channel. For instance, in the case of a poor chatbot, users might opt to call the call centre, go to a walk-in centre, or use email, app, etc. This means the user is not helped via the digital agent, as assumed. But opt to make use of another channel.

A high number of returning customers add pressure on other channels, does not deliver the automation as promised and shows customers are voting with their feet.

This is the worse case scenario where returning customers are high, low containment and bad ratings. This is a sure sign the bot is not working.

This metric is also rather harsh, as the two visits might not be related, but the challenge lies here for the conversational agent to build contextual awareness and resolve queries in a pro-active manner.


The advantage of NPS is that it creates an avenue to reach out to customers, have a conversation and understand the reason for their rating. In many instances the negativity from the user is not directly related to the chatbot or the voicebot. But is rather product, service and organisational related.

This takes a collective effort and the conversational agent becomes a mechanism, or at least a major contributor, to drive up enterprise wide NPS and CSAT.



Cobus Greyling

I explore and write about all things at the intersection of AI & language; LLMs/NLP/NLU, Chat/Voicebots, CCAI.