General Chatbot Architecture, Design & Development Overview
…And What Components Are Required To Constitute A Conversational Interface
Learn what technologies must work well together for users to have a conversation…
Chabots in of itself is hard to establish as a comprehensive conversational interface, adding voice adds significantly to this.
In this story I will go over a few architectural, design and development consideration to keep in mind.
A chatbot is a human-computer dialog system via natural language. Hence a human having a natural conversation with a computer or system.
The chatbot must be able to have a dialog and understand the user; you could describe this is a function of comprehension.
This comprehension include intent and entity recognition. Intents can be seen as verbs and entities as nouns.
Text based bots have in the very least a Natural Language Understanding (NLU) component.
Where as a voice bot demands an initial speech recognition layer (speech to text) and a final speech generation layer (text to speech).
All intelligence is not vested within the NLU capabilities. Bots must have access to an external base of knowledge and common sense via API’s; such that it can provide the function of competence, answering user questions.
Lastly the embodied agent should provide a very functional presence. Ironically these digital agent did not exist up until recently and once regarded as very optional. Now this function proves to be crucial in the case of ordinary users.
Automatic Speech Recognition (ASR)
Speech Recognition or Speech-To-Text (STT) is a conversion process of turning speech in audio into text.
The goal of ASR is to achieve speaker-independent large vocabulary speech recognition.
Where chatbots have the luxury of addressing a very narrow domain, the STT/ASR must be able to field a large vocabulary. Ensuring whatever is said, can be converted to text.
The chatbot might not be able to directly address the query or request. Which might fall outside the domain of the chatbot. But the ASR must at the very least present accurate text to the chatbot/NLU portion.
Vocabularies started out very small, and only included basic phrases (e.g.yes, no, digits, etc.) and now include millions of words in many languages.
Speaker independence is a given. The ability exist to recognize specific speakers.
There is also a difference between speaker identification and verification.
This is only relevant if chatbots use the speaker’s identity to generate user-specific responses.
It is problematic if there is a continuous stream of words, which do not necessarily contain breaks between words.
Hence it is helpful to give the user a signal to start talking, and keep the utterance as short as possible.
The ability to filter out background noise is vital. This background noise interfering with the speaker’s utterance can be traffic, background music, other people speaking etc.
Microphones can be near-field (like AirPods) or far-field like the Amazon Echo devices. A handset or mobile phone is the microphone in the case of a telephone call. This refers to the ability to process speech at varying distances from microphone.
Factors in speech recognition can be environmental noise, emotional state, fatigue, and distance from microphone.
Hence speech/non-speech segmentation is vital. The ASR system must distinguish between the phonemes (basic unit of speech) that should be recorded for translation vs. the back ground noise.
Natural Language Understanding
Natural Language Understanding underpins the capabilities of the chatbot.
Without entity detection and intent recognition all efforts to understand the user come to naught.
Most chatbot architectures consist of four pillars, these are typically intents, entities, the dialog flow (State Machine), and scripts.
The dialog contains the blocks or states a user navigates between. Each dialog is associated with one or more intents and or entities. Session variables can also be employed the decide on which states or nodes must be visited.
The intents and entities constitute the condition on which that dialog is accessed.
The dialog contains the output to the customer in the form of a script, or a message…or wording if you like.
This is one of the most boring and laborious tasks in crafting a chatbot. It can become complex and changes made in one area can inadvertently impact another area. A lack of consistency can also lead to unplanned user experiences.
Scaling this environment is tricky especially if you want to scale across a large organisation.
Responses to the user starts with the text dialog deemed as the appropriate response to the user. This text response normally comes from a list or set of possible responses. The particular dialog or response is chosen based on the state or dialog point the conversation is at.
The response can also be constituted in the case were a value or a phone number needs to be embedded in the response in a natural way.
In the case of a voicebot, this text must be spoken to the user.
The speech is generated by a Speech Synthesis engine. This is also a comprehensive solution which must be able to synthesize any text into audio. These environments are language dependent.
A interesting component of Text-To-Speech is Speech Synthesis Markup Language (SSML).
SSML is a markup language allowing you to tweak how speech should be generated.
In this example below the string “12345” is spoken back to the user as “twelve thousand three hundred and forty five”. In the second instance, as an ordinal, “twelve thousand three hundred and forty fifth” and lastly as digits, “one two three four five.”
as a cardinal number:
as a ordinal number:
Then there is also experimentation in terms of natural language generation.
Commercial NLG is emerging and forward looking solution providers are looking at incorporating it into their solution. At this stage you might be struggling to get your mind around the practicalities of this. Below are two practical examples which might help.
In the video here, I got a data set from kaggle.com with about 185,000 records.
Each of these records where a newspaper headline which I used to create a TensforFlow model from.
Based in this model, I could then enter one or two intents, and random “fake” (hence non-existing) headlines were generated. There are a host of parameters which can be used to tweak the output used.
Conversational Best Practice
Digression is a common and natural part of most conversations…
The speaker, introduces a topic, subsequently the speaker can introduce a story that seems to be unrelated. And then return to the original topic.
Digression can also be explained in the following way… when an user is in the middle of a dialog, also referred to customer journey, Topic or user story.
And, it is designed to achieve a single goal, but the user decides to abruptly switch the topic to initiate a dialog flow that is designed to address a different goal.
Hence the user wants to jump midstream from one journey or story to another. This is usually not possible within a Chatbot, and once an user has committed to a journey or topic, they have to see it through. Normally the dialog does not support this ability for a user to change subjects.
Often an attempt to digress by the user ends in an “I am sorry” from the chatbot and breaks the current journey.
Hence the chatbot framework you are using, should allow for this, to pop out and back into a conversation.
…and another is disambiguation. Often throughout a conversation we as humans will invariably and intuitively detect ambiguity.
Ambiguity is when we hear something which is said, which is open for more than one interpretation. Instead of just going off on a tangent which is not intended by the utterance, I perform the act of disambiguation; by asking a follow-up question. This is simply put, removing ambiguity from a statement or dialog.
Ambiguity makes sentences confusing. For example, “I saw my friend John with binoculars”.
This this mean John was carrying a pair of binoculars?
Or, I could only see John by using a pair of binoculars?
Hence, I need to perform disambiguation, and ask for clarification.
A chatbot encounters the same issue, where the user’s utterance is ambiguous and instead of the chatbot going off on one assumed intent, it could ask the user to clarify their input.
The chatbot can present a few options based on a certain context; this can be used by the user to select and confirm the most appropriate option.
Just to illustrate how effective we as humans are to disambiguate and detect subtle nuances, have a look at the following two sentences:
- A drop of water on my mobile phone.
- I drop my mobile phone in the water.
These two sentences have vastly different meanings, and compared to each other there is no real ambiguity, but for a conversational interface this will be hard to detect and separate.
In conclusion, suffice to say that the holy grail of chatbots is to mimic and align with a natural, human-to-human conversation as much as possible. And to add to this, when designing the conversational flow for a chatbot, we often forget about what elements are part and parcel of true human like conversation.
Digression is a big part of human conversation, along with disambiguation of course. Disambiguation negates to some extent the danger of fallback proliferation where the dialog is not really taken forward.
With disambiguation a bouquet of truly related and contextual options are presented to the user to choose from which is sure to advance the conversation.
And finally, probably the worse thing you can do is present a set of options which is not related to the current context. Or a set of options which is predefined and finite which reoccurs continually.
Contextual awareness is key in all elements of a chatbot.