Sitemap
Photo by Flo Karr on Unsplash

Getting Started With IBM Voice Agent With Watson

Here Are A Few Examples On How To Program & Customize Your VoiceBot

7 min readApr 4, 2020

--

Introduction

Once your VoiceBot is up and running, and you can place a telephone call to it, you would want to program it.

Here I look at a few more advanced features of the IBM Voice Agent, or also referred to as the Voice Gateway. The voice agent makes use of the following components:

  • Speech To Text (Converting User Speech Into Text)
  • Watson Assistant (Conversational AI)
  • Text To Speech (Converting Bot Text Into Voice)
Watson Service Orchestration. Source

The voice Agent allows you to voice enable your chatbot. The heart of the conversational experience is Watson Assistant.

IBM Multilingual Virtual Voice Agent Demo

Watson Assistant hosts the following four conversational elements:

  • Intents
  • Entities
  • Dialogs &
  • State Management

The other two components constituting the VoiceBot are Speech To Text (STT) and inversely, Text To Speech (TTS).

The Text to Speech portion is the easier part to tweak, as there is a host of languages and voices available to choose from.

IBM Voice Agent With Watson

In some cases various accents are available for specific languages, personas and more.

SSML (Speech Syntheses Markup Language) enables the tweaking of the voice.

To only name a few voice elements which can be set:
pitch, contour, pitch range, rate, duration, volume.

A text based chatbot cannot merely be converted to a voicebot, or speech enabled assistant. I wrote a Medium story on the key considerations and challenges when looking to augment a text based assistant with voice capabilities.

It is not practical to list all the functions or commands in this story; only a few crucial and interesting commands are listed.

It must be noted that all the examples below are actual examples I have running in my IBM Cloud instance.

For a complete list of commands take a look here.

Placing Your First Call

You will need a twilio and IBM Cloud account to create your first VoiceBot. On twilio select the Elastic SIP Trunking option to add your Watson SIP address and also reserve a number.

Within Voice Agent with Watson there is a four step process to launch your voice bot into the wild.

When creating a voice agent, Watson Voice Agent automatically searches for any available Watson service instances you can use in creating your voice agent. This is a very useful feature and can save you cost.

Certain plans also allows for a certain amount of instances, and reusing existing instances of TTS, STT etc. can save resources.

twilio Elastic SIP Trunking

If no service instance is available, you can create one along with the voice agent or connect to services in a different IBM Cloud account.

It is also possible to use other could elements like Google Speech to Text, or Google Text to Speech instance.

On your dashboard, go to the Voice agents tab and click Create a Voice Agent.

Select Voice when you need to choose the agent type.or Name, specify a unique name for your voice agent. As you might have a list of services, choose a descriptive name. Also consider adding the region in which your services are. This will come in handy later. The name can be up to 64 characters.

For Phone number, add the number from your SIP trunk, including the country and area codes. The phone number can have a maximum of 30 characters, including spaces and + ( ) — characters. I successfully set up a South African number using Twilio. Choosing a local number based on your location saves money when it comes to demo time; and also while testing from a phone.

Four Step Setup Within IBM Cloud

You can add multiple numbers to one Voice Agent by clicking Manage, next to Phone Number.

To enable call transfer, enter the termination URI for your Default transfer target.

Under Conversation, configure the connection to your Watson Assistant service instance by clicking Location 1 or Location 2 and enabling the location that you selected.

You can use Watson Assistant service instances in IBM Cloud accounts that you or someone else owns. You can also connect to any of these options through a service orchestration engine.

The Watson Assistant portion of you solution holds the logic, dialog, intents and entities.

Most of your programming, logic and general Voice Agent behavior is determined here.

Under Text to Speech, review the default configuration for your Text to Speech service instance by clicking Location 1 or Location 2 and enabling that location. You can customize your configuration with the following.

Ending A Call

The IBM Voice Gateway has a few command which allows you to program your Voice Gateway very efficiently. One of the is vgwActHangup. This command can be used to issue a hangup to end the call from the program’s side.

{
"output": {
"vgwAction": {
"command": "vgwActHangup"
},
"generic": [
{
"response_type": "text",
"values": [],
"selection_policy": "sequential"
}
]
}
}

The JSON code can be added to the conversational node within Watson Assistant as shown here.

Adding JSON To terminate The Call

The typical approach would be to have an intent which catches anything hangup or end-the-call related. It is best practice to have a confirmation node; do you want to end the call, yes or no.

On confirmation from the user, the call termination can be invoked and the call ended.

Change TTS Voice In Call

The voice of the bot can be changed on the fly. You can write a routine where the user says “I want to speak to Mike”, and Mike can answer and take the call from there.

{
"output": {
"text": {
"values": [
"Hi this is Mike! How can I help?"
],
"selection_policy": "sequential"
}
},
"context": {
"vgwTTSConfigSettings": {
"config": {
"voice": "en-US_MichaelVoice"
}
}
}
}

This is the JSON portion you will need to embed in one of the Watson Assistant dialog nodes.

JSON Portion within the Dialog Node To Change the TTS Voice

The same can done for any of the other voices…

If the user says, I want so speak to Kate, a dialog node with the following JSON is called:

{
"output": {
"text": {
"values": [
"This is Kate, and I am the Great Britain voice, How can I help? "
],
"selection_policy": "sequential"
}
},
"context": {
"vgwTTSConfigSettings": {
"config": {
"voice": "en-GB_KateVoice"
}
}
}
}

The TTS service is updated on the fly, within the same call, and it is as if the call is handed over to another person.

Change Language In Call

The language of the call can also be chanted on the fly, within a live call. A user might say, can we speak Italian, or can we speak French. In this case the voicebot can change to a different language all-together.

Should a user say, I want to speak German…or the language detection is used to sense the user is speaking German, the language and TTS voice can be updated.

{
"output": {
"text": {
"values": [
"Willkommen bei dieser IBM Watson-Demonstration. Was möchtest du mich fragen?"
],
"selection_policy": "sequential"
}
},
"context": {
"vgwTTSConfigSettings": {
"config": {
"voice": "de-DE_BirgitVoice"
}
}
}
}

In the JSON portion you can see the language and the locale is changed to a specific German TTS voice.

Watson Assistant View Of Dialog Node

In this fashion any available language, locale or voice can be invoked and this all happens in-call.

Handling Response Timeouts

The detection and handling of a response timeout is fairly standard; catching this event allows for handling the call intelligently.

Setting vgwPostResponseTimeout as Intent

Silence during a voice call must be avoided at all cost.

What I like about the voice agent gateway, is that the vgwPostResponseTimeout can be set directly as an intent within Watson Assistant.

This illustrates the level of integration between the two elements.

The Voice Gateway can be managed from within Watson Assistant on an intent basis.

Dialog Node Where an Assistant Response Is Defined

The response of the assistant can be defined at that point in the conversation.

Or any other action can be taken, like transferring the call to a live service representative.

Conclusion

Conversational interfaces are becoming pervasive and are expanding into different mediums. In this case, Conversational AI is extending into a traditional medium like a voice call.

Callers are not confined to the DTMF menu or keypad anymore and are allowed to speak freely. Obviously there will be challenges which will impede the perceived quality of the service.

Background noise, voice quality during the call and initial user screening will always dictate the user experience.

IBM Voice Agent with Watson & twilio

--

--

Cobus Greyling
Cobus Greyling

Written by Cobus Greyling

I’m passionate about exploring the intersection of AI & language. www.cobusgreyling.com

No responses yet