How Important Is Measuring Word Error Rate (WER) For Voicebots?

The recent open sourcing of OpenAI Whisper which approaches human level robustness and accuracy on English speech recognition has shed new light on the WER measurement.

5 min readOct 6, 2022

🗣 Voicebot = Conversation Design + NLU Design + ASR Design

1️⃣ Following the market it clear that there is significant focus on #voicebots, and automating customer calls to the contact centre…

2️⃣ Conversation Design is has always been prioritised, and for good reason. Recently I highlighted the importance of NLU Design…

3️⃣ But what about ASR Design? And how can the quality of ASR be measured?

﹡First the basics, the terms Speech Recognition, Automatic Speech Recognition (ASR) and Speech To Text (STT) are used interchangeable.

Is measuring the Word Error Rate (WER) of your Speech Recognition System important?

Yes! You should be measuring the WER!

And here are six reasons why…

1️⃣ A speech implementation of a Digital Assistant will in all likelihood have a separate NLU model. The reason for this is the unique nature of a speech interface, where users are more verbose. Implementing a directed dialog approach is harder with speech as opposed to chat/text. This is due to the mere fact that speech interfaces have invisible (non-graphic) design affordances.

As opposed to chat interfaces with varying degrees of graphic design affordances based on the medium of choice; SMS, WhatsApp, Messenger, etc.

2️⃣ The NLU model cannot absorb all the variations coming through via the ASR interface. Attempting to do this increases complexity in the NLU Design process.

3️⃣ Most ASR solutions have an option to fine-tune via text based training examples or training audio (acoustic model) or both.

Fine-tuning the ASR addresses challenges like speaker age, gender, ethnicity, accents, medium of access (for instance voice via a telephone call).

4️⃣ Measuring WER addresses three key challenges in transcribing speech:

◾️[I] Insertions — words added by the ASR

◾️[D] Deletions — words not detected by the ASR

◾️ [S] Substitutions — words which are incorrectly transcribed, and substituted by another word. Substitutions are the biggest hurdle to overcome in creating a voice bot. Usually Substitutions come about when industry specific words are not recognised and substituted.

For instance in the mobile industry the term “SIM Swap” is common, but a default ASR model (sans any fine-tuning) might translate it as “same swipe”…this is only one example.

Please follow me on LinkedIn for the latest updates on Conversational AI. 🙂

5️⃣ [N] The WER can be easily gamed and artificially inflated…

Let me explain…the base reference [N] is a set of recordings, which are human transcribed. This set is passed through the fine-tuned ASR model and compared to the human transcriptions to calculate [I], [D] and [S].

So here is the thing, the complexity, context and content of recordings in [N] needs to be representative of what users will say. Examples of this are place names, people names, product and services names, slang, etc.

6️⃣ The WER method is not perfect, but it is a benchmark which can be used as a reference as acoustic models and ASR text training data are updated. And any deprecation in NLU performance could be traced back to a new ASR acoustic model which negatively impacted the ASR accuracy.

Final Thoughts

The Achilles heel of WER is that WER penalises any difference between the model output and the referenced human transcript [N]. This means that output transcripts that would still be judged correct via human inspection (and interpreted correctly), will be marked as incorrect by WER due to minor formatting differences.

But! In the case of a voicebot, the ASR output will not be interpreted by human inspection…but a NLU model…hence the importance of accuracy.

Please follow me on LinkedIn for the latest updates on Conversational AI. 🙂

I’m currently the Chief Evangelist @ HumanFirst. I explore and write about all things at the intersection of AI and language; ranging from LLMs, Chatbots, Voicebots, Development Frameworks, Data-Centric latent spaces and more.

No-code tooling for NLU

The complete productivity suite to transform natural language into business insights and AI training data

www.humanfirst.ai

https://www.linkedin.com/in/cobusgreyling/

Get an email whenever Cobus Greyling publishes.

Get an email whenever Cobus Greyling publishes. By signing up, you will create a Medium account if you don't already…

cobusgreyling.medium.com

Eliza Language Technology Community - Language Technology: Conversational AI, NLP/NLP, CCAI…

ELIZA - Where language technology enthusiasts unite.

www.eliza.community

Microsoft Is Going Global With Speech Enablement

Voicebots have received significant attention of late due to the desire to automate customer voice calls to call…

cobusgreyling.medium.com

Lessons I Learnt From Launching A Voicebot

I wanted to write a definitive guide based on personal experiences while launching a voicebot. In this article I…

cobusgreyling.medium.com

Three Key Voicebot Design Considerations

Voicebots pose a number of unique challenges as opposed to text based chatbots. The hardest problem to solve is…

cobusgreyling.medium.com

Read the Whisper Paper here.

Introducing Whisper

We've trained and are open-sourcing a neural net called Whisper that approaches human level robustness and accuracy on…

openai.com

GitHub - openai/whisper: Robust Speech Recognition via Large-Scale Weak Supervision

Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a…

github.com

Whisper - a Hugging Face Space by bassazayda

Discover amazing ML apps made by the community

huggingface.co

How to transcribe and translate with OpenAI’s Whisper

Use Whisper for free with this Google Colab notebook

medium.com

How Important Is Measuring Word Error Rate (WER) For Voicebots?

The recent open sourcing of OpenAI Whisper which approaches human level robustness and accuracy on English speech recognition has shed new light on the WER measurement.

Yes! You should be measuring the WER!

And here are six reasons why…

Final Thoughts

No-code tooling for NLU

The complete productivity suite to transform natural language into business insights and AI training data

Get an email whenever Cobus Greyling publishes.

Get an email whenever Cobus Greyling publishes. By signing up, you will create a Medium account if you don't already…

Eliza Language Technology Community - Language Technology: Conversational AI, NLP/NLP, CCAI…

ELIZA - Where language technology enthusiasts unite.

Microsoft Is Going Global With Speech Enablement

Voicebots have received significant attention of late due to the desire to automate customer voice calls to call…

Lessons I Learnt From Launching A Voicebot

I wanted to write a definitive guide based on personal experiences while launching a voicebot. In this article I…

Three Key Voicebot Design Considerations

Voicebots pose a number of unique challenges as opposed to text based chatbots. The hardest problem to solve is…

Introducing Whisper

We've trained and are open-sourcing a neural net called Whisper that approaches human level robustness and accuracy on…

GitHub - openai/whisper: Robust Speech Recognition via Large-Scale Weak Supervision

Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a…

Whisper - a Hugging Face Space by bassazayda

Discover amazing ML apps made by the community

How to transcribe and translate with OpenAI’s Whisper

Use Whisper for free with this Google Colab notebook

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Cobus Greyling

No responses yet