Photo by Matteo Catanese on Unsplash

Updated: Chatbots Should Be An Abstraction Of Human Conversation

10 Principles To Craft Compelling Natural Conversations

Cobus Greyling
20 min readMay 5, 2022

--

Introduction

The problem all Conversational AI frameworks face is that of crafting an ecosystem for creating a natural conversational experiences.

  1. Elements of natural human-to-human conversations need to be identified and then abstracted.
  2. In turn, these elements need to be orchestrated in a framework, which could be no-code, low-code or pro-code, or a combination.
  3. This framework must allow users to design, integrate and craft conversations.
  4. These crafted conversations needs to be packaged and surfaced to users. Here is also were the iterative process starts again, going from point 4 back to point 1. But this time not necessarily abstracting conversational elements from human-to-human conversations, but from human-to-bot conversations. There are a number of tools available to navigate from point 4 to 1. One of those are HumanFirst, read more about this process here.
The process of creating a conversational AI ecosystem. By abstracting conversational elements, orchestrating these elements in a UI, allowing users to craft conversations using these building blocks. And lastly, having a process of packaging the conversation and surfacing it to the user via a medium of sorts. This medium can be web chat, messenger, speech etc.

As seen above, the process of creating a conversational AI ecosystem. By abstracting conversational elements, orchestrating these elements in a UI, allowing users to craft conversations using these building blocks.

This sequence of events holds true when a Conversational AI framework is created, but also when conversational experiences are crafted.

And lastly, having a process of packaging the conversation and surfacing it to the user via a medium of sorts. This medium can be web chat, messenger, speech etc.

The true measure of a Conversational AI framework is how well real-world conversation-elements are abstracted and orchestrated in a UI.

Generalisation

So as we have said, to build everyday, natural conversations with any given tool; all the concepts, objects and attributes of a real-world conversation need to be abstracted.

Hence the process of creating abstract concept-objects by mirroring common features or attributes of a non-abstract real-life conversation.

These abstracted elements in turn need to be orchestrated into a development environment to create a Conversational AI framework.

In this process of abstraction, detail are lost, through generalisation.

  • Generalisation is necessary in order to constitute building blocks…
  • Generalisation is also important to be accurate in crafting conversations, and have control over the conversations. We want control, with flexibility
The building block of the Nuance Mix ecosystem. You will see there is a clear separation of NLU and Dialog Features. Yet later in the dialog design process there is a link back to NLU. This is the challenge of any Conversational AI ecosystem, to have elements coupled together but also independent so some degree.

As seen above, the building block of the Nuance Mix ecosystem. You will see there is a clear separation of NLU and Dialog Features. Yet later in the dialog design process there is a link back to NLU. This is the challenge of any Conversational AI ecosystem, to have elements coupled together but also independent so some degree.

This principle is evident in the architecture of Cognigy. Cognigy is indeed a complete and very cohesive Conversational AI framework. Their product is truly seamless in developing conversational experiences, yet the different components can be used independently. So the Cognigy solution can be used as a NLU API, or dialog management can be used in isolation. Or the whole solution can even be white labeled.

Control & Flexibility

Control and flexibility; these two need to be in balance:

  • Control without Flexibility negates key conversational elements like digression, disambiguation, clarification, domain/relevance, compound intents & entities.
  • Flexibility without Control definitely has a certain allure, but this gives rise to the lack of affordances for fine-tuning conversations. This is very much the case with OpenAI’s Language API. Immense flexibility which is enticing, but limited fine-tuning.

There are Conversational AI frameworks with a high degree of control, IBM Watson Assistant counts among those. This is not a bad thing and allows for rapid adoption of the technology and proliferation of bots. Add to this list IBM Watson Assistant Action Skills and Microsoft Power Virtual Agents.

Then you have OpenAI with their Language API, where their implementation of Natural Language Generation (NLG) is really so good, it’s jarring. The way context is maintained through multiple dialog turns, general bot fallback resilience…all aspects are staggering. But with this immense flexibility control is hard…and by control we are thinking of fine-tuning.

And lastly, the abstracted elements must then be used to constitute conversations which simulate real-world conversations as close as possible. Hence closing the loop.

When creating or rather crafting a chatbot conversation we as designers must draw inspiration and guidance from real-world conversations.

Still life by Dutch painter, Henk Helmantel. I found viewing his paintings in real life jarring. You know it is an abstraction of reality, yet appearing to be real.

Elements of human conversation should be identified and abstracted to be incorporated in our chatbot conversation.

General rules and concepts of human conversations must be derived and implemented via technically astute means.

Below I list 10 elements of human conversation which should be incorporated in a Conversational AI interface. Conversational designers want users to speak to their chatbot as to a human…hence it is time for the chatbot to converse more human like.

Christoph Niemann has fascinating ideas on abstraction and when visual design becomes too abstract.

1️⃣ Digression

Digression is a common and natural part of most conversations…

Here is the scenario, the speaker introduces a topic, subsequently the speaker introduces a second topic, another story that seems to be unrelated. And then return to the original topic.

Or, digression can also be explained in the following way… when a user is in the middle of a dialog, also referred to customer journey, Topic or user story.

Above, a conversation where the user can digress from the Balances flow to the Branches flow. The user cannot digress back as the branches NLU does not have balances attached.

And, it is designed to achieve a single goal, but the user decides to abruptly switch the topic to initiate a dialog flow that is designed to address a different goal.

Hence the user wants to jump midstream from one journey or story to another.

This is usually not possible within a Chatbot, and once a user has committed to a journey or topic, they have to see it through. Normally the dialog does not support this ability for a user to change subjects.

Often an attempt to digress by the user ends in an “I am sorry” from the chatbot and breaks the current journey. Or there is fallback proliferation.

Hence the chatbot framework you are using, should allow for digression. Where users pop out and back into a conversation.

The easy approach is to structure the conversation very rigid from the chatbot’s perspective. And funnel the user in and out of the conversational interface, this might even present very favorably in reporting. But the user experience is appalling.

Overly structuring the conversation breaks the beauty of a conversational interface. Unstructured conversational interfaces is hard to craft but makes for an exceptional user experience.

One of the reasons is that users are so use to having to structure their input, that they want to enjoy and exercise the freedom of speech (spoken or text), which can lead to disappointment if the expectation of freedom is not met.

Digression should also not be too lenient. Should this be the case, then users might inadvertently digress and then there is conversational breakdown. It makes sense to prompt users prior to digression in some instances.

2️⃣ Disambiguation

AS mentioned before, you will find that many of the basic elements of human conversation are not introduced to most chatbots.

A good example of this as we have seen is digression 👆🏻…and another is disambiguation. Often throughout a conversation we as humans will invariably and intuitively detect ambiguity.

IBM Watson Assistant Example of Disambiguation Between Dialog Nodes

Ambiguity is when we hear something which is said, which is open for more than one interpretation. Instead of just going off on a tangent which is not intended by the utterance, I perform the act of disambiguation; by asking a follow-up question.

This is simply put, removing ambiguity from a statement or dialog.

Ambiguity makes sentences confusing. For example, “I saw my friend John with binoculars”. This this mean John was carrying a pair of binoculars? Or, I could only see John by using a pair of binoculars?

Hence, I need to perform disambiguation, and ask for clarification. A chatbot encounters the same issue, where the user’s utterance is ambiguous and instead of the chatbot going off on one assumed intent, it could ask the user to clarify their input. The chatbot can present a few options based on a certain context; this can be used by the user to select and confirm the most appropriate option.

Just to illustrate how effective we as humans are to disambiguate and detect subtle nuances, have a look at the following two sentences:

A drop of water on my mobile phone.

I drop my mobile phone in the water.

These two sentences have vastly different meanings, and compared to each other there is no real ambiguity, but for a conversational interface this will be hard to detect and separate.

IBM Watson Assistant Initial Configuration for Disambiguation

Disambiguation allows the chatbot to request contextual clarification from the user. A list of related options should be pretested to the user, allowing the user to disambiguate the dialog by selecting an option from the list.

But, the list presented should be relevant to the context of the utterance; hence only contextual options must be presented.

Disambiguation enables chatbots to request help from the user when more than one dialog node might apply to the user’s query.

Instead of assigning the best guess intent to the user’s input, the chatbot can create a collection of top nodes and present them. In this case the decision when there is ambiguity, is deferred to the user.

What is really a win-win situation is when the feedback from the user can be used to improve your NLU model; as this is invaluable training data vetted by the user.

Disambiguation can be triggered when the confidence scores of the runner-up intents, that are detected in the user input, are close in value to the top intent.

Hence there is no clear separation and certainty.

There should of course be a “non of the above” option, if a user selects this, a real-time live agent handover can be performed, or a call-back can be scheduled. Or, a broader set of option can be presented.

3️⃣ Auto Learning

As a human concierge or receptionist will learn over time and improve in their job, a chatbot should also learn over time and improve. Learning should take place automatically.

Here is a practical example of achieving this…

For example, the ideal chatbot conversation is just that, conversation-like. Natural language are highly unstructured. When the conversation is not gaining traction, it does make sense to introduce a form of structure.

This form of structure is ideally:

  • A short menu of 3 to 4 items presented to the user.
  • With Menu Items contextually linked to the context of the last dialog.
  • Acting to disambiguate the general context.
  • And an option for the user to establish a undetected context.

Once the context is confirmed by the user, the structure can be removed from the conversation. Where the conversation can then ensue unstructured with natural language.

IBM is really the leader with this principle of a chatbot learning to prioritise automatically.

The brief introduction of structure is merely there as a mechanism to further the dialog. This serves as a remedy against fallback proliferation.

The idea behind autolearning is to order these disambiguation menus according to use or user popularity.

A practical example:

When a customer asks a question that the assistant isn’t sure it understands, the assistant often shows a list of topics to the customer and asks the customer to choose the right one.

This process is called disambiguation.

If, when a similar list of options is shown, customers most often click the same one option #2, for example), then your skill can learn from that experience.

It can learn that option #2 is the best answer to that type of question. And next time, it can list option #2 as the first choice, so customers can get to it more quickly.

And, if the pattern persists over time, it can change its behavior even more. Instead of making the customer choose from a list of options at all, it can return option #2 as the answer immediately.

The premise of this feature is to improve the disambiguation process over time to such an extend, that eventually the correct option is presented to the user automatically. Hence the chatbot learns how to disambiguate on behalf of the user.

4️⃣ Domain & Irrelevance

A service agent is not trained to answer questions which are irrelevant and outside the domain of the organization.

How do you develop for user input which is not relevant to your design…

In general chatbots are are designed and developed for a specific domain. These domains are narrow and applicable to to the concern of the organization they serve. Hence chatbots are custom and purpose built as an extension of the organization’s operation, usually to allow customers to self-service.

As an added element to make the chatbot more interactive and lifelike, and to anthropomorphize the interface, small talk is introduced. Also referred to as chitchat.

But what happens if a user utterance falls outside this narrow domain? With most implementations the highest scoring intent is assigned to the users utterance, in a frantic attempt to field the query.

Negate False Intent Assignment

So, instead of stating the intent is out of scope, in a desperate attempt to handle the user utterance, the chatbot assigns the best fit intent to the user; often wrong.

Alternatively the chatbot continues to inform the user it does not understand; and having the user continuously rephrasing the input. Rather have the chatbot merely state the question is not part of its domain.

A handy design element is to have two or three sentences serve as an intro for first-time users; sketching the gist of the chatbot domain.

The traditional approaches are:

  • Many “out-of-scope” examples are dreamed up and entered. Which is hardy ever successful.
  • Attempts are made to disambiguate the user input.

But actually, the chatbot should merely state that the query is outside of its domain and give the user guidance.

OOD & ID

So, user input can broadly be divided into two groups, In-Domain (ID)and Out-Of-Domain (OOD)inputs. ID inputs are where you can attach the user’s input to an intent based on existing training data. OOD detection refers to the process of tagging data which does not match any label in the training set; intent.

An example of the IBM Watson Assistant testing and training interface. User utterances can be assigned to an existing intent, or marked as irrelevant.

Traditionally OOD training requires large amounts of training data, hence OOD not performing well in current chatbot environments.

An advantage of most chatbot development environments is a very limited amount of training data; perhaps 15 to 20 utterance examples per intent.

We don’t want developers spending vast amounts of time on an element not part of the bot’s core.

The challenge is that as a developer, you need provide training data and examples. The OOD or irrelevant input is possibly an infinite amount of scenarios as there is no boundary defining irrelevance.

The ideal is to build a model that can detect OOD inputs with a very limited set of data defining the intent; or no OOD training data at all.

The second option being the ideal…

5️⃣ Complex Intents

Structure is being built into intents in the form of hierarchal, or nested intents (HumanFirst & Cognigy). Intents can be switched on and off, weights can be added. Thresholds are set per intent for relevance, a threshold can in some cases be set for a disambiguation prompt. Sub-patterns within intents (see Kore.ai). Kore.ai also has sub-intents and follow-up intents.

Hierarchical Intents (Nested Intents)

The main activity in the chatbot marketplace around intents and entities are:

Both HumanFirst and Cognigy offer hierarchical or nested intents. Cognigy offer 3 levels deep nesting of intents.

The intent Balances (red) has three sub-intents (green), and third level intent (yellow). This example is from Cognigy.

As seen above, The intent Balances (red) has three sub-intents (green), and third level intents (yellow).

On the utterance: “savings balance for my own account”, the NLU returns:

"intentLevel": {
"level1": "Balances",
"level2": "Personal Accounts",
"level3": "Savings"
}

You can see the advantage of nested intents, knowing that a balance was requested, for a personal account, and the account type is Savings.

HumanFirst allows much deeper nesting. See below, HumanFirst affords users the capability to nest intents (sub intents). In this example you can see a complex nesting of sub-intents under Sports.

You can see in this example the nesting is 7 levels deep; Sports/BaseBall/Teams/Top5Teams/New York Yankees/Tickets/Season.

HumanFirst affords users the capability to nest intents (sub intents). In this example you can see a complex nesting of sub-intents under Sports. You can see in this example the nesting is 7 levels deep; Sports/BaseBall/Teams/Top5Teams/New York Yankees/Tickets/Season.

HumanFirst affords users the capability to nest intents (sub intents). In this example you can see a complex nesting of sub-intents under Sports. You can see in this example the nesting is 7 levels deep; Sports/BaseBall/Teams/Top5Teams/New York Yankees/Tickets/Season.

6️⃣ Anthropomorphize

People respond well too personas, also to a graphic representation of a persona. We anthropomorphise things by nature; cars, ships, other inanimate objects…and chatbots are certainty no different. User perception of the chatbot definitely affects how they engage and interact.

The profile image you select for your chatbot plays a big role. With the script, language and wording of your chatbot.

The most engaging profile image for your chatbot will be that with a persona, a face. This face should have a name, and also a way of speaking, a vocabulary which is consistent and relevant to the persona you want to establish. This is crucial, as this persona will grow, and in time be your most valuable employee, all be it digital.

A persona will grow in use, over multiple channels, in scope and functionality. Hence the importance of this foundation.

7️⃣ Named Entities

Without any training, we as humans can understand and detect general entities like Amazon, Apple, South America etc.

But first, what is an entity?

Entities are the information in the user input that is relevant to the user’s intentions.

Intents can be seen as verbs (the action a user wants to execute), entities represent nouns (for example; the city, the date, the time, the brand, the product.). Consider this, when the intent is to get a weather forecast, the relevant location and date entities are required before the application can return an accurate forecast.

Recognizing entities in the user’s input helps you to craft more useful, targeted responses. For example, You might have a #buy_something intent. When a user makes a request that triggers the #buy_something intent, the assistant’s response should reflect an understanding of what the something is that the customer wants to buy. You can add a product entity, and then use it to extract information from the user input about the product that the customer is interested in.

In NLP, a named entity is a real-world object, such as people, places, companies, products etc.

These named entities can be abstract or have a physical existence. Below are examples of named entities being detected by Riva NLU.

Named Entities code block in the Jupyter Notebook

Example Input:

Jensen Huang is the CEO of NVIDIA Corporation, located in Santa Clara, California.

Example Output:

Named Entities:
jensen huang (PER)
nvidia corporation (ORG)
santa clara (LOC)
california (LOC)
Extracting Named Entities from a Sentence

For instance, spaCy has a very efficient entity detection system which also assigns labels. The default model identifies a host of named and numeric entities. This can include places, companies, products and the like.

  • Text: The original entity text.
  • Start: Index of start of entity in the doc
  • End: Index of end of entity in the doc
  • Label: Entity label, i.e. type
Detail On Each Named Entity Detected

There are named entities we all expect are common knowledge, these should also be common knowledge to your chatbot. The ideal is if these are included in your NLU model out-of-the-box.

8️⃣ Mixed Modality & Conversational Components

Will we see much of this with Web 3.0?

What is the Web 3.0? It is the third generation of the Internet…a network which is underpinned by intelligent interfaces and interactions.

The Web 3.0 will be constituted by software, with the browser acting as the interface or access medium.

An Amazon Echo Show skill integrated to the Mercedes-Benz API for vehicle specifications and images. The user can ask questions regarding Mercedes vehicles and a topical interactive display is rendered which van be viewed, listened and touch navigation. Follow-up speech commands can also be issued.

For example, an Amazon Echo Show skill integrated to the Mercedes-Benz API for vehicle specifications and images. The user can ask questions regarding Mercedes vehicles and a topical interactive display is rendered which van be viewed, listened and touch navigation. Follow-up speech commands can also be issued.

Most probably the closest comparison currently to the Web 3.0 are devices like the Amazon Echo Show or Google Nest Hub.

Where we see multiple modalities combined into one user experience.

The user issues speech input and the display renders a user interface with images, text and/or video. The content can be viewed, listened to, with touch or speech navigation.

This multi-modal approach lowers the cognitive load as user input will primarily be via voice and not typing. Media is projected to the user via text, images/video and speech. Refining queries based on what is presented will must probably result in touch navigation.

Hence we will see full renderings of contextual content based on the spoken user intent.

A big advantage of Web 3.0 is that a very small team can make significant breakthroughs seeing it is software based.

Amongst other key elements, the Web 3.0 will be defined by personalized bots which serves users in specific ways.

A demonstration of available templates and layouts of an Amazon Echo Show skill. Click navigation, with display options, audio or follow-up questions.

These bots will facilitate intelligent interactions with the user and all relevant devices.

Bots will interact via voice, text and contextual data. Focusing on customer service, support, informational, sales, recommendations and more.

Intelligent conversational bots will not only communicate in text but any appropriate media.

This new iteration of the web will have pervasive bots which will surface in various ways and linked to the context of the user’s interaction.

Imagine a user is reading through your website, and at a particular point they can click on text which takes them to a conversational interface which is contextually linked to where the user clicked.

Another speculative illustration of the Web 3.0, with a speech interface to issues commands to the Mercedes-Benz vehicle API. The display changes based on speech input.

Another speculative illustration of the Web 3.0, with a speech interface to issues commands to the Mercedes-Benz vehicle API. The display changes based on speech input.

The example above is a integration of an Alexa skill with the Mercedes-Benz vehicle API.

9️⃣ Adding Context & Structure To Entities

Compound & Contextual Entities

Huge strides have been made in this area and many chatbot ecosystems accommodate these.

Contextual Entities

The process of annotating user utterances is a way of identifying entities by their context within a sentence.

Contextual Entity Annotation In IBM Watson Assistant

Often entities have a finite set of values which are defined. Then there are entities which cannot be represented by a finite list; like cities in the world or names, or addresses. These entity types have too many variations to be listed individually.

For these entities, you must use annotations; entities defined by their contextual use. The entities are defined and detected via their context within the user utterance.

Compound Entities

The basic premise is that users will utter multiple entities in one sentence.

Users will most probably express multiple entities within one utterance; referred to as compound entities.

In the example below, there are four entities defined:

  • travel_mode
  • from_city
  • to_cyt
  • date_time
Rasa: Extract of NLU.md File In Rasa Project

These entities can be detected within the first pass and confirmation solicited from the user.

Let’s have a look at more complex entity structures and how they are implemented in Rasa, Microsoft Luis and Amazon Alexa…

Microsoft LUIS

Decomposition

Machine Learned Entities were introduced to LUIS November 2019. Entity decomposition is important for both intent prediction and for data extraction with the entity.

We start by defining a single entity, called

  • Travel Detail.

Within this entity, we defined three sub-entities. You can think of this as nested entities or sub-types. The three sub-types defined are:

  • Time Frame
  • Mode
  • City

From here, we have a sub-sub-type for City:

  • From City
  • To City
Defining an Entity With Sub-Types which can be Decomposed

This might sound confusing, but the process is extremely intuitive and allows for the natural expansion of conversational elements.

Data is is presented in an easy understandable format. Managing your conversational environment will be easier than previously.

Adding Sub-Entities: ML Entity Composed of Smaller Sub-Entities

Now we can go back to our intent and annotate a new utterance. Only the From City still needs to be defined.

Annotating Utterance Example with Entity Elements

Here are the intent examples, used to train the model with the entity, sub-types, and sub-sub-types; fully contextualized.

Annotated Intent Examples

🔟 Variation

A lack of variation makes the interaction feel monotonous or robotic. It might takes some programmatical effort to introduce variation, but it is important.

Restaurant review is created from a few key words and the restaurant name.

Many development frameworks have functionality which allows you to easily randomize your bot’s output. Or at least have a sequence of utterances which breaks any monotony.

An apple pie review based on four generic words.

Conclusion

The impediment most often faced when implementing these ten elements of human conversation is not design considerations. But that of technical encumbrance. Hence the decision of which platform to use become all the more crucial.

--

--

Cobus Greyling

I explore and write about all things at the intersection of AI & language; LLMs/NLP/NLU, Chat/Voicebots, CCAI. www.cobusgreyling.com