Photo by Usukhbayar Gankhuyag on Unsplash

Bootstrapping A Chatbot With A Large Language Model

How To Harness The Power Of OpenAI In Creating A Chatbot From Scratch

Cobus Greyling
9 min readJun 30, 2022

--

Introduction

In this article I cover a practical approach on how to create a chatbot from scratch, using Large Language Models. I will be illustrating this method with step-by-step examples while leveraging OpenAI and a new concept I like to call intent-documents.

This as an approach I have conceptualised over the last few weeks while focusing on Large Language Models (LLM). I believe this is a feasible and practical methodology to combine LLM features to create a chatbot by orchestrating clustering, semantic search and Generation (NLG).

The unstructured data used for creating the chatbot is a list of facts pertaining to the continent of Africa. Hence the purpose of this chatbot is to answer questions related to the African continent, while maintaining state and contextual awareness.

The conventional approach is NLU training using user utterances, in this approach the bot utterances (facts) are used to create and train the bot. Hence the process is inverted to some degree, with the bot training performed using bot messages and content.

Recently I wrote about the HumanFirst and Co:here POC integration which was a good example on how to leverage LLM’s for real-world production implementations.

So, here is a practical example of how OpenAI can be leveraged to create a chatbot…

Architecture Overview

My approach is a chatbot implemented as a thin abstraction layer premised on the OpenAI LLM.

Below is a basic sequence diagram describing the steps involved and how these steps relate to each-other.

Step 1: Clustering & Intent Detection

The first step is to use OpenAI’s clustering to make sense of large volumes of unstructured text data.

As per the OpenAI documentation:

Clustering is one way of making sense of a large volume of textual data. Similarity embedding is useful for this task, as it is a semantically meaningful vector representation of each text. Thus, in an unsupervised way, clustering will uncover hidden groupings in our dataset.

With unsupervised training, clusters are created of sentences representing related facts. These groupings are given a theme name; these themes are in essence intents, used to create intent-documents.

Currently OpenAI does not cater for clustering within the Playground.

Clustering with OpenAI is detailed within a Notebook example and is a highly technical Python code based solution. For clustering the data (facts of Africa) it was much easier to again revert to the HumanFirst to ingest the unstructured and unlabelled data.

Unfortunately no integration exists between HumanFirst and OpenAI, like in the case of Co:here. For the intent-documents it is important to be able to organise the intent data, both for the inception of the bot and continuous maintenance.

Comparing clustering results of Co:here and OpenAI will be really interesting.

Back to the process…the first step is to take our training data and cluster the training data into intents. Each cluster, or intent, is named and contains a list of associated facts.

As seen above, within the HumanFirst, the first intent is defined, named MeasurementRelativeSize. The list of facts which constitutes this intent are marked with the green block.

This is a quick and easy way to cluster unlabelled facts in an unsupervised fashion.

The intent name and the list of facts are uploaded to OpenAI as an intent-document. These intent-documents will use for semantic search.

Step 2: Semantic Search

The next step is to capture and understand the user utterance…

During runtime, when the user sends a message to the chatbot, the utterance is assigned to an intent and a subsequent intent-document is retrieved.

This is done via zero shot training, semantic search. The OpenAI semantic search functionality allows for the user utterance to be matched to a cluster of training utterances (also referred to as a document), which denotes an intent.

Below is an example of the JSON payload returned by OpenAI, referencing the applicable intent-document.

The intent-document approach described here is an efficient way of accessing and retrieving only a small portion of the training data/facts.

There are also other important reasons to organise training data in an automated fashion, hopefully I can expand on this in a later article.

This allows for the facts sent to the OpenAI Generation API be as short and concise as possible. Contributing to efficiency in cost, response times and very little to no aberrations in generated text.

You can read more of OpenAI document search and the fine-tuning process here.

Step 3: Generation

Once the correct intent-document is selected by semantic search…

The intent-document together with the entered user message are sent to OpenAI’s generation API.

The OpenAI generator returns an appropriate bot response to be presented to the user. The intent-document is used as few shot training data, basically this is used for on the fly bot prompt creation.

In the image below, the intent-document is marked by the purple block in the image below.

The question entered by the user is marked by the green arrow.

The OpenAI generated answer is marked by the red blocks.

You can see how well the NLG is formed by OpenAI and how contextually accurate the generated response is.

Below is the python code for the playground example shown and discussed prior.

Again, it is clear that the intent-documents need to be categorised and formed very accurately. All the while keeping the intent-document as succinct as possible, making the API calls more efficient and cutting down on overhead.

Step 4: Contextual Generation

Lastly, conversation state and context can be managed by again leveraging OpenAI Generation, by merely resubmitting the previous dialog turns of the conversation to the generator.

The OpenAI generation API performs extremely well in detecting these contextual references and responding accurately. However, provided that this few shot training data is labeled in the following way:

  • Facts: The contents of the applicable Intent-Document
  • Question: User Message Input
  • Answer: OpenAI Generated Bot Response

In the image below, you can see the intent-document marked in the purple block.

The green blow is the dialog turns with the user utterances.

You can notice the second question is ambiguous if considered in isolation.

However, if assessed in conjunction with the first question, the context is clear.

Below the extract from the OpenAI playground…

The responses of the OpenAI generation model can be rated, as useful or poor. This might not have a bearing on your chatbot directly, but can help OpenAI improve overall.

However, this task can be forwarded to the user, the results can also be captured within an abstraction layer for development and improvement of intent-documents. This where a data-centric tool like HumanFirst can be invaluable for a observeabiltty and identifying fine-tuning opportunities.

Conclusion

This prototype I developed illustrates a methodology of how a LLM like OpenAI can be leveraged, and how natural language generation is guided and augmented by using intent-documents.

Clustering can help discover valuable, hidden groupings within the data.

~ OpenAI

The HumanFirst and Co:here POC did well to illustrate how a granular approach can be followed for intent detection and management. Also, how the long tail of conversation design can be defined and planned for, with granular clustering.

The first part which is really important, is the creation of these intent-documents. It was extremely convenient and intuitive to create and manage these clusters with the HumanFirst and Co:here integration. I would have been really interesting to compare it to an integration between the HumanFirst and OpenAI.

Lastly, intent-documents has a purpose and relevance beyond this prototype example, hopefully I will write about this soon.

--

--