Photo by Jason Chen on Unsplash

Voicebots & The Importance Of Face Speed

Challenges In Deploying and Managing Speech Interfaces

Cobus Greyling
4 min readJun 24, 2022

--

Introduction

I have written extensively on voicebots from the perspectives of:

However, voicebots or speech interfaces are especially difficult to get right due to the synchronous nature of the interaction. Add to this the additional moving parts of Automatic Speech Recognition (ASR), managing Word Error Rate (WER) and speech synthesis.

All these elements are discussed in detail, in the articles listed in the footer.

In this story I would like to discuss the concept of Face Speed, and consider instances where Face Speed is not always possible, like a phone call.

Living Services

Few reports have had such a profound impact on my thought process, as the 2015 report from Fjord Design & Innovation named The Era of Living Services.

It opened my mind to the idea of user interfaces as a living service, which are continuously adapting to the user.

Thinking of services as ambient and existing within the user’s environment, and the services being orchestrated based on user movement and behaviour. All the while, surfacing the right data, at the right time, via the right medium or interface.

Speech or chat are just two of a number of user interfaces. Other interfaces or input methods include gestures, facial expressions, user routines and behaviours, etc.

All of these elements mentioned above, constitutes an ever changing and adapting, ambient orchestrated service or interface.

Conversational AI Skills

For example, the multimodal aspect of NVIDIA Riva is best understood in this context of NVIDIA Riva’s available user interfaces:

  • ASR (Automatic Speech Recognition)
  • STT (Speech To Text)
  • NLU (Natural Language Understanding)

And more specifically…

  • Gesture Recognition
  • Lip Activity Detection
  • Object Detection
  • Gaze Detection
  • Sentiment Detection

The ultimate and truly humanlike speech interface.

Face Speed

We are use to facial expressions being part of our conversations. We intuitively read each-other’s faces while we have a conversation.

As we make interfaces more humanlike, users will expect it to be synchronous and instantaneous. Users will be less tolerant of delays and a computer that is thinking.

We as users are expecting conversational interfaces to respond at face speed.

For example, designers are anthropomorphising user interfaces, making it more human-like and conversation-like. However, by implication the user’s expectation from the interface, is for it to have human-like face-speed characteristics.

There are two challenges here…the first is that the user is presented with a simple and natural interface, where they feel very much at home. The interface is simplified by removing complexity. This complexity needs to be accommodate and hosted somewhere else. And in the case of conversational systems, the complexity is hosted within the user interface. Hence, taking the complexity away from the user, means adding complexity to conversational design and development.

The second challenge with a GUI is that when more user affordances are added graphically, the worst the interface gets. Thus with a GUI, less is more when it comes to design.

On the contrary, with a voice or conversational interface, the more complexity, the better, because conversation design affordances are invisible from a user perspective.

Lastly…

Face Speed has two components, the speed at which data is delivered. And also the conversational affordances of face speed, detecting who is speaking, reading gestures, expressions and more.

NVIDIA Riva wants to solve for this, but for voicebots via a phone call, this will remain a problem. Turn taking and barge-in are two of the biggest challenges at this stage.

The answer might be in not trying to make the conversation too natural, but having a cue, something serving as a signal or suggestion for turn taking.

--

--