Chatbot Architecture Overview

A Look Under the Hood

7 min readAug 26, 2019

Whenever I see a chatbot implementation, or hear someone speak on their chatbot solution I always wonder what lies under the hood. Sometimes I find a chatbot design is underpinned by an astute framework and solid development effort. Other times the underlying technology is scant and rudimentary…

Here are a few basic design principles being implemented currently.

The Impostor, The Standard and The Visionary.
But first…

What is Your Definition of a Chatbot?

One way to approach this question is to say a chatbot is in essence a conversational interface, which is made possible of “understanding” natural conversation.

Or, you might want to describe a chatbot as any form of unstructured text input, which then needs to be structured on the application side, so a particular action can be taken. So it is the absence of a structured input where the user needs to navigate via visual elements.

The other approach is to describe a chatbot as an interface that exists in any multi-threaded asynchronous messaging environment. So, should your chatbot exist within Facebook Messenger, or WhatsApp for Business, well, then it is a “chatbot”. Even though many of these chatbots are menu and/or button driven.

Hence relying heavily on Conversational Components.

Conversational Components

Conversational Components are the different elements available within a medium.

Within Facebook Messenger there are buttons, carousels, menus, quick replies and the like. This allows for the developer to leverage these conversational components to the maximum hence negating any notion of a pure natural language conversational interface. It is merely a graphic interface living within a conversational medium. Or, if you like, a graphic user interface made available within a conversational container.

Is this wrong, no.

It is taking the interface to the user’s medium of choice. Hence making it more accessible and removing the friction apps present.

Facebook Messenger: Leveraging Conversational Components to the Maximum

The problem with such an approach is that you are heavily dependent on the affordance of the medium. Should you want to roll your chatbot out on WhatsApp, the same conversational components will not be available, and you will have to opt for NLU, keyword spotting or a menu driven solution. Where the menu is constituted by keywords or numbers. Which is shown below.

WhatsApp Equivalent of the travel bot with no NLU

The menu will have to be a key word or number the user need to input to navigate the UI. Is it a chatbot? Technically, yes; as it lives in a multi-threaded asynchronous conversational medium.

But is it conversational in nature? One would have to argue no. So in the second example a rudimentary approach needs to be followed for one single reason, the medium does not afford the rich functionality and is primarily text based.

The Impostor

So how can sense be made of the natural language input. There are impostor chatbots where the user input is searched and “word spotting” is employed to find a match. This match is then linked to a state machine, aka fixed logic, aka dialog flow. So a particular match of a word or sequence of words take the user to a particular point in the flow.

So you can imagine how good this works if the happy path is taken during testing. And the chatbot can handle all the natural language unstructured input. However, this solution cannot scale, vertically or horizontally. Language detection is not plausible. This implementation will fail the moment dialogs / user input becomes longer. Multiple sentences and multiple intents will be lost. Not to mention contextual entities.

Capturing of entities and contextual entities are beyond the reach of this solution. Form filling can only be accomplished via a very rigid and fixed dialog and the expected data at the right time in the right format.

The Standard

The standard and most common approach is to have a NLU (Natural Language Understanding) component. The NLU can be broken up into intents and entities. Intents are merely detecting the intent or intention of the user within that particular dialog.

This is much easier in theory than in practice. Usually a group of intents are defined and for each intent in set of example inputs are given. The NLU solution then creates a model based on the defined set of intents and example inputs for each intent.

The process of intent classification is the basis for allowing a user to use natural language input.

So intents are purposes or goals that are expressed in a customer’s input, such as answering a question or processing a bill payment. By recognizing the intent expressed in a customer’s input, the Watson Assistant service can choose the correct dialog flow for responding to it.

Entities represent information in the user input that is relevant to the user’s purpose.

If intents represent verbs (the action a user wants to do), entities represent nouns (the object of, or the context for, that action). For example, when the intent is to get a weather forecast, the relevant location and date entities are required before the application can return an accurate forecast.

This all ties into a dialog flow, or conversation flow. So the intents or entities point to points within this flow. The flow is also referred to as a state machine or state management.

This portion is still a very rigid and rule based element of chatbots. This portion also holds the dialog, the text used to communicate with the user.

Dialog Flow Example from IBM Watson Assistant

If you look at the example below, you will notice that the intent is always travel related, but there are three entities which needs to be captured. They are day, city and transport mode. These entities are hidden within the intent.

The only way to intelligently extract these entities is for your NLU to be contextually aware when looking at the intent of the dialog.

Your NLU should afford you the possibility to detect entities based on their context within the dialog. This allows for a true natural conversation as you would with a human and for your Conversational AI system to pick out the relevant entities from the user dialog.

The aim is for the NLU to have the conversation as unstructured as possible from the user’s perspective and make the most of the data entered by the user.

This leads to a scenario of surprise and delight when using the interface.

The Visionary

This Ideal Chatbot Architecture is Briefly discussed Below

The ideal is where the basic NLU core of the chatbot is fronted by a higher order NLP layer. This NLP layer can be seen as a process where the user utterances / dialogs are pre-prosed and the data is prepared for interpretation of the NLU layer.

Most chatbot implementations are vulnerable due to the NLU layer not being resilient enough. The higher order NLP layer includes elements like Categorization. Category hierarchies which is fully custom, is implemented at this stage.

User input can be categorized, and custom categories can be created based in the specific use-case. Entity Types, and contextual entities form part of this. Language detection, relations and machine learning models are also implementable in this layer. Relations are particularly helpful in structuring natural language input.

The relation type used differs based on which language you are using. For example, the link between a Person and the place where they were born. The link between a Person and the Date or Time when they were born. Existing between a Person and another entity such as a Person or Organization that he or she manages as his or her job, etc.