Chatbots Should Be An Abstraction Of Human Conversation
10 Elements Of A Conceptual Process To Derive General Conversational Rules & Concepts
Introduction
When creating or rather crafting a chatbot conversation we as designers must draw inspiration and guidance from real-world conversations.
Elements of human conversation should be identified and abstracted to be incorporated in our chatbot conversation.
General rules and concepts of human conversations must be derived and implemented via technically astute means.
Below I list 10 elements of human conversation which can be incorporated in a Conversational AI interface. Conversational designers want users to speak to their chatbot as to a human…hence it is time for the chatbot to converse more human like.
Christoph Niemann has fascinating ideas on abstraction and when visual design becomes too abstract.
1️⃣ Digression
Digression is a common and natural part of most conversations…
The speaker introduces a topic, subsequently the speaker introduces a second topic, another story that seems to be unrelated.
And then return to the original topic.
Digression can also be explained in the following way… when an user is in the middle of a dialog, also referred to customer journey, Topic or user story.
And, it is designed to achieve a single goal, but the user decides to abruptly switch the topic to initiate a dialog flow that is designed to address a different goal.
Hence the user wants to jump midstream from one journey or story to another.
This is usually not possible within a Chatbot, and once a user has committed to a journey or topic, they have to see it through. Normally the dialog does not support this ability for a user to change subjects.
Often an attempt to digress by the user ends in an “I am sorry” from the chatbot and breaks the current journey.
Hence the chatbot framework you are using, should allow for digression. Where users pop out and back into a conversation.
The easy approach is to structure the conversation very rigid from the chatbot’s perspective. And funnel the user in and out of the conversational interface, this might even present very favorably in reporting. But the user experience is appalling.
Overly structuring the conversation breaks the beauty of a conversational interface. Unstructured conversational interfaces is hard to craft but makes for an exceptional user experience.
One of the reasons is that user’s are so use to having to structure their input, that they want to enjoy and exercise the freedom of speech (spoken or text), which can lead to disappointment if the expectation of freedom is not met.
2️⃣ Disambiguation
By now we all know that the prime aim of a chatbot is to act as a conversational interface, simulating the conversations we have as humans…
Unfortunately you will find that many of the basic elements of human conversation are not introduced to most chatbots.
A good example of this as we have seen is digression 👆🏻…and another is disambiguation. Often throughout a conversation we as humans will invariably and intuitively detect ambiguity.
Ambiguity is when we hear something which is said, which is open for more than one interpretation. Instead of just going off on a tangent which is not intended by the utterance, I perform the act of disambiguation; by asking a follow-up question.
This is simply put, removing ambiguity from a statement or dialog.
Ambiguity makes sentences confusing. For example, “I saw my friend John with binoculars”. This this mean John was carrying a pair of binoculars? Or, I could only see John by using a pair of binoculars?
Hence, I need to perform disambiguation, and ask for clarification. A chatbot encounters the same issue, where the user’s utterance is ambiguous and instead of the chatbot going off on one assumed intent, it could ask the user to clarify their input. The chatbot can present a few options based on a certain context; this can be used by the user to select and confirm the most appropriate option.
Just to illustrate how effective we as humans are to disambiguate and detect subtle nuances, have a look at the following two sentences:
- A drop of water on my mobile phone.
- I drop my mobile phone in the water.
These two sentences have vastly different meanings, and compared to each other there is no real ambiguity, but for a conversational interface this will be hard to detect and separate.
Disambiguation allows the chatbot to request clarification from the user. A list of related options should be pretested to the user, allowing the user to disambiguate the dialog by selecting an option from the list.
But, the list presented should be relevant to the context of the utterance; hence only contextual options must be presented.
Disambiguation enables chatbots to request help from the user when more than one dialog node might apply to the user’s query.
Instead of assigning the best guess intent to the user’s input, the chatbot can create a collection of top nodes and present them. In this case the decision when there is ambiguity, is deferred to the user.
What is really a win-win situation is when the feedback from the user can be used to improve your NLU model; as this is invaluable training data vetted by the user.
Disambiguation can be triggered when the confidence scores of the runner-up intents, that are detected in the user input, are close in value to the top intent.
Hence there is no clear separation and certainty.
There should of course be a “non of the above” option, if a user selects this, a real-time live agent handover can be performed, or a call-back can be scheduled. Or, a broader set of option can be presented.
3️⃣ Auto Learning
As a human concierge or receptionist will learn over time and improve in their job, a chatbot should also learn over time and improve. Learning should take place automatically.
Here is a practical example of achieving this…
For example, the ideal chatbot conversation is just that, conversation-like. Natural language are highly unstructured. When the conversation is not gaining traction, it does make sense to introduce a form of structure.
This form of structure is ideally:
- A short menu of 3 to 4 items presented to the user.
- With Menu Items contextually linked to the context of the last dialog.
- Acting to disambiguate the general context.
- And an option for the user to establish a undetected context.
Once the context is confirmed by the user, the structure can be removed from the conversation. Where the conversation can then ensue unstructured with natural language.
The brief introduction of structure is merely there as a mechanism to further the dialog. This serves as a remedy against fallback proliferation.
The idea behind autolearning is to order these disambiguation menus according to use or user popularity.
A practical example:
When a customer asks a question that the assistant isn’t sure it understands, the assistant often shows a list of topics to the customer and asks the customer to choose the right one.
This process is called disambiguation.
If, when a similar list of options is shown, customers most often click the same one option #2, for example), then your skill can learn from that experience.
It can learn that option #2 is the best answer to that type of question. And next time, it can list option #2 as the first choice, so customers can get to it more quickly.
And, if the pattern persists over time, it can change its behavior even more. Instead of making the customer choose from a list of options at all, it can return option #2 as the answer immediately.
The premise of this feature is to improve the disambiguation process over time to such an extend, that eventually the correct option is presented to the user automatically. Hence the chatbot learns how to disambiguate on behalf of the user.
4️⃣ Domain & Irrelevance
A service agent is not trained to answer questions which are irrelevant and outside the domain of the organization.
How do you develop for user input which is not relevant to your design…
In general chatbots are are designed and developed for a specific domain. These domains are narrow and applicable to to the concern of the organization they serve. Hence chatbots are custom and purpose built as an extension of the organization’s operation, usually to allow customers to self-service.
As an added element to make the chatbot more interactive and lifelike, and to anthropomorphize the interface, small talk is introduced. Also referred to as chitchat.
But what happens if a user utterance falls outside this narrow domain? With most implementations the highest scoring intent is assigned to the users utterance, in a frantic attempt to field the query.
Negate False Intent Assignment
So, instead of stating the intent is out of scope, in a desperate attempt to handle the user utterance, the chatbot assigns the best fit intent to the user; often wrong.
Alternatively the chatbot continues to inform the user it does not understand; and having the user continuously rephrasing the input. Rather have the chatbot merely state the question is not part of its domain.
A handy design element is to have two or three sentences serve as an intro for first-time users; sketching the gist of the chatbot domain.
The traditional approaches are:
- Many “out-of-scope” examples are dreamed up and entered. Which is hardy ever successful.
- Attempts are made to disambiguate the user input.
But actually, the chatbot should merely state that the query is outside of its domain and give the user guidance.
OOD & ID
So, user input can broadly be divided into two groups, In-Domain (ID)and Out-Of-Domain (OOD)inputs. ID inputs are where you can attach the user’s input to an intent based on existing training data. OOD detection refers to the process of tagging data which does not match any label in the training set; intent.
Traditionally OOD training requires large amounts of training data, hence OOD not performing well in current chatbot environments.
An advantage of most chatbot development environments is a very limited amount of training data; perhaps 15 to 20 utterance examples per intent.
We don’t want developers spending vast amounts of time on an element not part of the bot’s core.
The challenge is that as a developer, you need provide training data and examples. The OOD or irrelevant input is possibly an infinite amount of scenarios as there is no boundary defining irrelevance.
The ideal is to build a model that can detect OOD inputs with a very limited set of data defining the intent; or no OOD training data at all.
The second option being the ideal…
5️⃣ Compound Intents
In human conversation, if a request is too long, we break it up into smaller portions and address each subject, or intent separately.
In short the problem is…the user input is too long, with multiple requests in one sentence or utterance.
In essence compound intents…
The medium impacts the message, and in some mediums, like sms/text and messaging applications in the general, the user input might be shorter. Then, in mediums access via a keyboard or a browser, the user input is again longer.
The longer user input can have multiple sentences, with numerous user intents embedded in the text.
There can also be multiple entities. Users don’t always speak in single intent and entity utterances.
On the contrary, the users will speak in compound utterances. The moment these complex user utterances are thrown at the chatbot, the bot needs to play a game of single intent winner.
Which intent from the whole host of intents from the user is going to win this round of dialog turn?
But…what if the chatbot could detect, that it just received four sentences; the intent of the first one is weather tomorrow in Cape Town. The second sentence is the stock price for Apple, the third is an alarm for tomorrow morning etc.
Too ambitious you might think?
Not at all…very possible, doable and the tools to achieve this exist.
Best of all, many of these tools are opensource and free to use…
6️⃣ Anthropomorphize
People respond well to personas, also to a graphic representation of a persona.
We anthropomorphize things by nature; cars, ships, other inanimate objects…and chatbots are certainty no different. User perception of the chatbot definitely affects how they engage and interact.
The profile image you select for your chatbot plays a big role. With the script, language and wording of your chatbot.
The most engaging profile image for your chatbot will be that with a persona, a face. This face should have a name, and also a way of speaking, a vocabulary which is consistent and relevant to the persona you want to establish. This is crucial, as this persona will grow, and in time be your most valuable employee, all be it digital.
A persona will grow in use, over multiple channels, in scope and functionality. Hence the importance of this foundation.
7️⃣ Named Entities
Without any training, we as humans can understand and detect general entities like Amazon, Apple, South America etc.
But first, what is an entity?
Entities are the information in the user input that is relevant to the user’s intentions.
Intents can be seen as verbs (the action a user wants to execute), entities represent nouns (for example; the city, the date, the time, the brand, the product.). Consider this, when the intent is to get a weather forecast, the relevant location and date entities are required before the application can return an accurate forecast.
Recognizing entities in the user’s input helps you to craft more useful, targeted responses. For example, You might have a #buy_something intent. When a user makes a request that triggers the #buy_something intent, the assistant’s response should reflect an understanding of what the something is that the customer wants to buy. You can add a product entity, and then use it to extract information from the user input about the product that the customer is interested in.
For instance, spaCy has a very efficient entity detection system which also assigns labels. The default model identifies a host of named and numeric entities. This can include places, companies, products and the like.
- Text: The original entity text.
- Start: Index of start of entity in the doc
- End: Index of end of entity in the doc
- Label: Entity label, i.e. type
There are named entities we all expect are common knowledge, these should also be common knowledge to your chatbot. The ideal is if these are included in your NLU model out-of-the-box.
8️⃣ Mixed Modality & Conversational Components
Human-to-human conversation takes place via various modalities, mixing modalities in one conversation and utilizing various conversational components.
Firstly we need to think of any conversation, be it in voice or text, of being presented to the user by means of conversational components…
Conversational Components are the different elements (or affordances if you like) available within a specific medium.
For instance, within Facebook Messenger there are buttons, carousels, menus, quick replies and the like.
These allow the developer to leverage these conversational components to the maximum hence negating any notion of a pure natural language conversational interface.
It is merely a graphic interface living within a conversational medium. Or, if you like, a graphic user interface made available within a conversational container.
Is this wrong…no.
It is taking the interface to the user’s medium of choice. Hence making it more accessible and removing the friction apps present.
The problem with such an approach is that you are heavily dependent on the affordances of the medium.
Should you want to roll your chatbot out on WhatsApp, the same conversational components will obviously not be available, and you will have to lean more on NLU, keyword spotting or a menu driven solution. With a even more basic medium like SMS are are really dependent on NLU or a number/key word driven menu.
Where the menu is constituted by keywords or numbers, the menu will have to be a key word or number the user need to input to navigate the UI.
Is it a chatbot?
Technically, yes; as it lives in a multi-threaded asynchronous conversational medium.
But is it conversational in nature? One would have to argue no.
So in the second example a rudimentary approach needs to be followed for one single reason, the medium does not afford the rich functionality and is primarily text based.
With Voice, these components disappear completely. With only the natural language component available within the Voice User Interface (VUI) environment there are no components to leverage with the affordances are invisible.
9️⃣ Contextual Entities
Detecting entities which are embedded within user utterances remains a challenge. Especially should you want to capture the entities via unstructured methods and truly conversational.
As humans we generally do not have any trouble detecting entities within their context…neither should a Conversational Interface.
Intents can be seen as verbs, the intention of the user. You can think of Google Search as the biggest intent detection machine in the world.
Entities can be seen as nouns. Should a user say: I am taking the train from Paris to Lisbon…then the entities are: train, Paris & Lisbon.
And this is where Microsoft’s LUIS really the leader in defining entities contextually.
Of course rudimentary methods can be employed to extract entities from one sentence or more…these include:
- Prompt the user for each entity individually, one after the other. Regardless if the user has already said it or not.
- Use word spotting or regular expressions to spot or extract specific words. As data grows, this becomes increasingly not feasible.
The process of annotating is a way of identifying entities by their context within a sentence.
Often entities have a finite set of values which are defined. Then there are entities which cannot be represented by a finite list; like cities in the world or names, or addresses. These entity types have too many variations to be listed individually.
For these entities, you must use annotations; entities defined by their contextual use.
🔟 Variation
A lack of variation makes the interaction feel monotonous or robotic. It might takes some programmatical effort to introduce variation, but it is important.
Many development frameworks have functionality which allows you to easily randomize your bot’s output. Or at least have a sequence of utterances which breaks any monotony.
Conclusion
The impediment most often faced when implementing these ten elements of human conversation is not design considerations. But that of technical encumbrance. Hence the decision of which platform to use become all the more crucial.