Photo by Julien Tondu on Unsplash

GPT-3 Fine-Tuning Beta Release Is Now Enabled

Here Is A Detailed Look At Fine-Tuning Now Enabled on GPT-3


On 13 July 2021 OpenAI enabled fine-tuning for all users who have API access. This feature is currently in beta, so some parameters most probably will changed.

The idea from OpenAI is that fine-tuning of this nature afford users the opportunity to train a model, which will should yield answers in keeping with the training data.

All tests were performed using the OpenAI CLI (Command Line Interface). In some instances cURL, the Playground or Python code can be used. However, the OpenAI CLI lends the best structure to the training process.

Once a model has been fine-tuned, you won’t need to provide examples in the prompt anymore.

For a general purpose chatbot, the training data can be minimal.

Perhaps 20 examples per intent; at most for starters. However, when creating a data set for training it is advised that you use a few hundred training examples.

For classification at least 100 training examples are required per class, for some training examples more than 500 records of training data is demanded.

This is not in keeping with other environments like Rasa, IBM Watson Assistant, Microsoft LUIS etc., where astounding results can be achieved with relative few training examples.

At a high level, fine-tuning involves the following steps:

  1. Prepare and upload training data
  2. Train a new fine-tuned model
  3. Use your fine-tuned model

General Observations on Level 4 & 5 Chatbots

I believe Rasa is very accurate in their assessment of what a level one to level five chatbot constitutes. And to have a true Level 4 or 5 chatbot or conversational agent, the layers of constraint need to be removed. In other words, the rigid layers which introduce this straight-laced approach requires deprecation.

Allow me to explain…

The three areas of rigidity are indicated by the arrows, as discussed here.

1. Intents

Intent deprecation has been introduced by Rasa, IBM, Microsoft and Alexa. Even if only in an experimental and limited capacity.

The reason behind this is that a finite list of intents are usually defined. Subsequently every single user request needs to me mapped to a single pre-defined intent. This is an arduous task to segment the chatbot’s domain of concern into different intents.

Simultaneously ensuring that there is no overlap or gaps with the defined intents. But, what if we could go directly from user utterance to meaning? To the best matching dialog for the user utterance?

2. Conversation State Management

Whilst the NLU Model is a machine learning model, where there is a sense of interpretation on the NLU model’s side of the user utterance, and intents and entities are assigned to the user input…

The user utterance is assigned to an intent. In turn the intent is linked to a particular point in the state machine.

…even though the model was not trained in that specific utterance…

This is not the case with the Conversation State Manager, also referred to as the dialog flow system.

In most cases it is a decision-tree approach, where intents, entities and other conditions are evaluated in order to determine the next dialog state.

The user conversation is dictated by this rigid and pre-determined flow.

3. Chabot Text or Return Dialog (NLG)

The script or dialog the chatbot returns and presents to the user is also well defined and rigid. And again the return wording is very much linked one-to-one, to the dialog state manager. With each dialog node having a set response.

GPT-3 is seeking to change this, and they have depreciated all three of these elements. There is no need for defined entities, the dialog state management and conversation context, just happens.

And lastly, the return dialog or wording is deprecated with real-time Natural Language Generation (NLG).

Whilst GPT-3 has made this leap, which makes for an amazing demo and prototype, there are some pitfalls.

Some pitfalls are:

  1. Large amounts of training data is required. This is time consuming, especially when the training data needs to be crafted.
  2. Training are grouped in different models, which probably needs to be invoked in different scenarios and managed.
  3. The level of unpredictability, or the number of response aberrations are not low enough.
  4. For a commercial solution there are cases where entities need to be defined contextually.
  5. Classification (intents) and entity training demands large amounts of data. At least 100 training examples per class, and a few hundred for entities.

The Prototype Environment

I found the easiest way to run the OpenAI CLI was to spin up an Ubuntu instance on AWS, and run the commands via SSH and PuTTY.

The OpenAI CLI is very responsive and easy to use. Its level of simplicity should yield good adoption.

openai api completions.create -m ada:ft-user-
sdfsfaeafdfrwrefasfjvlss-2021-07-30-19-19-27 -p <YOUR_PROMPT>

The trained model can be invoked with the command above, referencing the model ID.

# List all created fine-tunes
openai api fine_tunes.list

All the models you have trained are listed with the list command. Training does take a while, and calling a model with the user prompt at this stage is quite sluggish in response. Slow for chat, and definitely not suited for a voicebot.

Back to GPT-3 Fine-Tuning

For the prototype I created a JSONL file with 1,500 entries of questions and answers for Kaggle.

{"prompt":"Did the U.S. join the League of Nations?","completion":"No"}{"prompt":"Where was the League of Nations created?","completion":"Paris"}

GPT-3 fine tuning does support Classification, Sentiment analysis, Entity Extraction, Open Ended Generation etc. The challenge is always going to be, to allow users to train the conversational interface:

  • With as little data as possible,
  • whilst creating stable and predictable conversations,
  • and allowing for managing the environment (and collaboration).

OpenAI has a tool to upload the training data and in turn the OpenAI CLI assesses the training data…

openai tools fine_tunes.prepare_data -f qa.txt

And reverts with suggestions…

Analyzing...- Based on your file extension, you provided a text file
- Your file contains 1476 prompt-completion pairs
- `completion` column/key should not contain empty strings. These are rows: [1475]
Based on the analysis we will perform the following actions:- [Necessary] Your format `TXT` will be converted to `JSONL`
- [Necessary] Remove 1 rows with empty completions
- [Recommended] Remove 159 duplicate rows [Y/n]: Y
- [Recommended] Add a whitespace character to the beginning of the completion [Y/n]: Y

Training is initiated with the command:

openai api fine_tunes.create -t <TRAIN_FILE_ID_OR_PATH> -m <BASE_MODEL>openai api fine_tunes.create -t qa.jsonl -m ada

The training job is queued and can take quite a while to test. With current and established chatbot development environments, quick iterations can be followed of:

  1. Compile training data
  2. Train
  3. Test
  4. Make changes

With GPT-3, seemingly steering the model using training data will be hard.

The model can be tested in the following way:

openai api completions.create -m curie:ft-user-
sfjkjkljksasfooeesfs-2021-07-30-21-30-40 -p "Is it a winter sports resort, although it is perhaps best known as a tax haven?"
Is it a winter sports resort, although it is perhaps best known as a tax haven? Yes. It is a winter sports resort.

Something interesting from the response is, the training data was:

{"prompt":"Is it a winter sports resort , although it is perhaps best known as a tax haven ?","completion":"Yes"}

And the response from GPT-3 was:

Yes. It is a winter sports resort.

This is a very conversational augmentation of the short training example of “yes”.


GPT-3 as a conversational environment is definitely moving in the right direction. With GPT-3, it seems if OpenAI started at the opposite end as other chatbot framework providers.

They introduced a low-code level 4/5 chatbot, with lacking reliable responses and fine-tuning.

A custom trained model can be tested in the Playground and a custom trained response is yielded. Using the playground makes it easier to switch between models and test different scenarios.

Fine-tuning is the avenue to a more reliable or predictable chatbot; especially for a corporate or enterprise solution.

Some considerations:

  • Training of smaller samples of data will help with benchmarking and quick iterations.
  • Defining entities contextually within intent examples is important; I did not test this feature; as at least 500 training examples are required.
  • Having different trained models to manage can be a challenge. Will an abstraction layer be required to determine which model is applicable in specific scenarios?

GPT-3 can be a disruptive force, once they achieve a more structured and cohesive fine-tuning approach. One which is conducive for collaboration of larger teams. At times I wonder if GPT-3 is targeting to become a NLP / general conversational tool. Or if there are ambitions to become a low-code chatbot development framework.

In accurately evaluating GPT-3's NLU/P capability, it is prudent to keep the vision of OpenAI in mind…

Our API provides a general-purpose “text in, text out” interface, which makes it possible to apply it to virtually any language task. This is different from most other language APIs, which are designed for a single task, such as sentiment classification or named entity recognition.

The API runs models with weights from the GPT-3 family with many speed and throughput improvements.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Cobus Greyling

Chief Evangelist @ HumanFirst. I explore and write about all things at the intersection of AI and language; NLP/NLU/LLM, Chat/Voicebots, CCAI.