Creating A Custom Fine-Tuned Model With OpenAI’s GPT-3 Language API

I followed the simplest method possible of creating a custom fine-tuned generative model using OpenAI’s GPT-3 Language API. In this article I step through the process. Also, I compare generative results from the custom model with results from the curie and the latest text-davinci-3 model.

Cobus Greyling
8 min readJan 5, 2023

--

I’m currently the Chief Evangelist @ HumanFirst. I explore and write about all things at the intersection of AI and language; ranging from LLMs, Chatbots, Voicebots, Development Frameworks, Data-Centric latent spaces and more.

Introduction

For the successful enterprise implementation of a generative or predictive LLM a degree of fine-tuning will be required. Fine-Tuning is essential for industry or enterprise specific terms, jargon, product and service names, etc. A custom model is also important in being more specific in the generated results.

In this article I do a walk-through of the most simplified approach to creating a generative model for the OpenAI GPT-3 Language API.

In the conclusion I’ll address the opening graph depicting time and effort…

The process of creating and maintaining a fine-tuned model can be broken up into five basic steps:

1️⃣ Collect & Format Training Data

This step is the most demanding step in terms of effort and time. Relevant data needs to be collected, curated and formatted for submission to the LLM for training.

Traditionally this process of converting unstructured data into LLM training data is approached in a manual fashion.

For this prototype I created a JSON file with approximately 1,500 entries of questions and answers from Kaggle. Below is an extract of the training file:

{"prompt":"Did the U.S. join the League of Nations?",
"completion":"No"}
{"prompt":"Where was the League of Nations created?",
"completion":"Paris"}

This is also the JSONL format in which OpenAI requires the data to be in…

2️⃣ Validate & Test Training Data

OpenAI has a utility which runs automatically when uploading the training data, after which the OpenAI CLI validates the training data…

Below is the command to initiate the upload and vetting of data:

openai tools fine_tunes.prepare_data -f qa.txt

And, within the command line OpenAI reverts with suggestions via a prompt interface for the user to agree to the changes, or skip. I agreed to all the suggested changes and the OpenAI utility fixed all the anomalies in my training file.

Analyzing...- Based on your file extension, you provided a text file
- Your file contains 1476 prompt-completion pairs
- `completion` column/key should not contain empty strings. These are rows: [1475]Based on the analysis we will perform the following actions:-
[Necessary] Your format `TXT` will be converted to `JSONL`
- [Necessary] Remove 1 rows with empty completions
- [Recommended] Remove 159 duplicate rows [Y/n]: Y
- [Recommended] Add a whitespace character to the beginning of the completion [Y/n]:

3️⃣ Upload Training Data & Commence Training

Training is initiated with the command:

openai api fine_tunes.create -t <TRAIN_FILE_ID_OR_PATH> -m <BASE_MODEL>openai api fine_tunes.create -t qa.jsonl -m curie

4️⃣ Test & Implement New Fine-Tuned Model

The newly trained model is accessible via command line or via the playground where the custom fine-tuned model is available under the list of models.

As seen below the easiest way of testing and benchmarking the new model is via the playground. Provided that you are testing with limited amounts of data. In cases where you test with a large corpus of data, it is advisable to use and modify one of the OpenAI notebooks.

In the image above you see a number of fine-tuned models in my account, we will be testing the latest model. There are of course considerations when choosing the base GPT-3 model from OpenAI. For this instance the base GPT-3 model used for custom fine-tuning is curie.

There are a few considerations when selecting a base model…cost versus performance or return is obviously front of mind.

⬇️ Example 1:

It is evident that the custom model return exactly the same result as the training data. The base curie model generates the same answer, but augmented. Lastly, text-davinci-003 has the most verbose and complete answer.

Standard GPT-3 curie Model:
{"prompt":"What is the largest country in the world?",
"completion":"The largest country in the world is Russia."}

Standard GPT-3 text-davinci-003 Model:
{"prompt":"What is the largest country in the world?",
"completion":"Russia is the largest country in the world, covering an area of 17,098,242 sq km (6,601,668 sq mi)."}

Training Data:
{"prompt":"What is the largest country in the world?",
"completion":"Russia"}

Custom GPT-3 Fine-Tuned Model:
{"prompt":"What is the largest country in the world?",
"completion":"Russia"}

⬇️ Example 2:

This test was particularly interesting as the fine-tuned custom GPT-3 model returned the trained answer, “the smew”, but the answer is augmented. I tested this a few times and the same augmented answer was given.

Interesting enough, curie and text-davinci-003 returned different answers all-together.

Standard GPT-3 curie Model:
{"prompt":"What is the name of a specialized species of duck adapted to catch large fish?",
"completion":"A merganser."}

Standard GPT-3 text-davinci-003 Model:
{"prompt":"What is the name of a specialized species of duck adapted to catch large fish?",
"completion":"A Harlequin Duck."}

Training Data:
{"prompt":"What is the name of a specialized species of duck adapted to catch large fish?",
"completion":"the smew"}

Custom GPT-3 Fine-Tuned Model:
{"prompt":"What is the name of a specialized species of duck adapted to catch large fish?",
"completion":"the smew, a species of duck in the Anatidae family."}

⬇️ Example 3:

Again the custom model matched the training data. Both the curie model and the text_davinci-003 models give different answers. This in itself is interesting as the question is very ambiguous and the answer can be one of four oceans.

An ambiguous question like this makes a good case for fine-tuning, if you want to shepherd the generative model into a certain direction for particular ambiguous scenarios.

Standard GPT-3 curie Model:
{"prompt":"What body of water lies to the south of Finland?",
"completion":"The Gulf of Finland."}

Standard GPT-3 text-davinci-003 Model:
{"prompt":"What body of water lies to the south of Finland?",
"completion":"The Baltic Sea."}

Training Data:
{"prompt":"What body of water lies to the south of Finland?",
"completion":"Archipelago Sea"}

Custom GPT-3 Fine-Tuned Model:
{"prompt":"What body of water lies to the south of Finland?",
"completion":"Archipelago Sea."}

⬇️ Example 4:

Lastly, the training data and custom model is in sync. The curie model returned incorrectly no. While the text-davinci-003 model returned a correct and verbose response to the question.

This question is hard on two fronts, it is a particularly niche question with little context.

Standard GPT-3 curie Model:
{"prompt":"Was Adams an opponent of the Stamp Act?",
"completion":"No. He was a strong supporter of the Stamp Act."}

Standard GPT-3 text-davinci-003 Model:
{"prompt":"Was Adams an opponent of the Stamp Act?",
"completion":"Yes, John Adams was an opponent of the Stamp Act. He wrote several essays and pamphlets criticizing the act and was a leader in the movement to repeal it."}

Training Data:
{"prompt":"Was Adams an opponent of the Stamp Act?",
"completion":"Yes"}

Custom GPT-3 Fine-Tuned Model:
{"prompt":"Was Adams an opponent of the Stamp Act?",
"completion":"Yes."}

🎯 When I asked the same question to the curie model with slightly more context by adding “John” to the question, the answer is correct.

The model can be tested in the following way from the command line:

openai api completions.create -m curie:ft-user-
fdfefsfrssasfooeesfs-2021-07-30-21-30-40 -p "Is it a winter sports resort, although it is perhaps best known as a tax haven?"

Something interesting from the response is, the training data was:

{"prompt":"Is it a winter sports resort , although it is perhaps best known as a tax haven ?","completion":"Yes"}

And the response from the OpenAI Langauge API was:

Yes. It is a winter sports resort.

This is a very conversational augmentation of the short training example of “yes” for a very ambiguous and arbitrary question.

5️⃣ Follow an Iterative Process of Improving the Model

Prior to launch, unstructured data needs to be explored, and through a NLU Design process, the unstructured data needs to be converted into highly structured NLU or NLG training data.

Subsequently the model’s performance must be evaluated and improvements identified. Via exploration additional or correct data is defined, etc.

In Conclusion

Something I have found is that the process of collecting, vetting and formatting the training data in mass demands the most effort and consumes time.

I’m currently the Chief Evangelist @ HumanFirst. I explore and write about all things at the intersection of AI and language; ranging from LLMs, Chatbots, Voicebots, Development Frameworks, Data-Centric latent spaces and more.

https://www.linkedin.com/in/cobusgreyling
https://www.linkedin.com/in/cobusgreyling

--

--

Cobus Greyling

I explore and write about all things at the intersection of AI & language; LLMs/NLP/NLU, Chat/Voicebots, CCAI. www.cobusgreyling.com