How Important Is Measuring Word Error Rate (WER) For Voicebots?
The recent open sourcing of OpenAI Whisper which approaches human level robustness and accuracy on English speech recognition has shed new light on the WER measurement.
🗣 Voicebot = Conversation Design + NLU Design + ASR Design
1️⃣ Following the market it clear that there is significant focus on #voicebots, and automating customer calls to the contact centre…
2️⃣ Conversation Design is has always been prioritised, and for good reason. Recently I highlighted the importance of NLU Design…
3️⃣ But what about ASR Design? And how can the quality of ASR be measured?
﹡First the basics, the terms Speech Recognition, Automatic Speech Recognition (ASR) and Speech To Text (STT) are used interchangeable.
Is measuring the Word Error Rate (WER) of your Speech Recognition System important?
Yes! You should be measuring the WER!
And here are six reasons why…
1️⃣ A speech implementation of a Digital Assistant will in all likelihood have a separate NLU model. The reason for this is the unique nature of a speech interface, where users are more verbose. Implementing a directed dialog approach is harder with speech as opposed to chat/text. This is due to the mere fact that speech interfaces have invisible (non-graphic) design affordances.
As opposed to chat interfaces with varying degrees of graphic design affordances based on the medium of choice; SMS, WhatsApp, Messenger, etc.
2️⃣ The NLU model cannot absorb all the variations coming through via the ASR interface. Attempting to do this increases complexity in the NLU Design process.
3️⃣ Most ASR solutions have an option to fine-tune via text based training examples or training audio (acoustic model) or both.
Fine-tuning the ASR addresses challenges like speaker age, gender, ethnicity, accents, medium of access (for instance voice via a telephone call).
4️⃣ Measuring WER addresses three key challenges in transcribing speech:
◾️[I] Insertions — words added by the ASR
◾️[D] Deletions — words not detected by the ASR
◾️ [S] Substitutions — words which are incorrectly transcribed, and substituted by another word. Substitutions are the biggest hurdle to overcome in creating a voice bot. Usually Substitutions come about when industry specific words are not recognised and substituted.
For instance in the mobile industry the term “SIM Swap” is common, but a default ASR model (sans any fine-tuning) might translate it as “same swipe”…this is only one example.
Please follow me on LinkedIn for the latest updates on Conversational AI. 🙂
5️⃣ [N] The WER can be easily gamed and artificially inflated…
Let me explain…the base reference [N] is a set of recordings, which are human transcribed. This set is passed through the fine-tuned ASR model and compared to the human transcriptions to calculate [I], [D] and [S].
So here is the thing, the complexity, context and content of recordings in [N] needs to be representative of what users will say. Examples of this are place names, people names, product and services names, slang, etc.
6️⃣ The WER method is not perfect, but it is a benchmark which can be used as a reference as acoustic models and ASR text training data are updated. And any deprecation in NLU performance could be traced back to a new ASR acoustic model which negatively impacted the ASR accuracy.
Final Thoughts
The Achilles heel of WER is that WER penalises any difference between the model output and the referenced human transcript [N]. This means that output transcripts that would still be judged correct via human inspection (and interpreted correctly), will be marked as incorrect by WER due to minor formatting differences.
But! In the case of a voicebot, the ASR output will not be interpreted by human inspection…but a NLU model…hence the importance of accuracy.
Please follow me on LinkedIn for the latest updates on Conversational AI. 🙂
I’m currently the Chief Evangelist @ HumanFirst. I explore and write about all things at the intersection of AI and language; ranging from LLMs, Chatbots, Voicebots, Development Frameworks, Data-Centric latent spaces and more.
Read the Whisper Paper here.