How To Create A Custom Fine-Tuned Prediction Model Using Base GPT-3 models

Large Langauge Models (LLMs) functionality can be divided into two broad categories: Generative & Predictive.

Cobus Greyling
8 min readFeb 2, 2023

--

Considering the generative and predictive capabilities of LLMs, generative capabilities has received the most attention. I would argue that LLMs excel at generative tasks, generative tasks are also immensely impressive and only requires zero of few-shot learning.

The nascent trend of Prompt Engineering has also contributed to emphasis on generative.

The image below lists the most common generative tasks from a Conversational AI Development Framework perspective; split between generative and predictive.

The predictive aspect of LLMs are more critical to get right, especially considering the actions which will be premised on this result.

The most common use of a predictive scenario for chatbots are intents. In essence an intent is a classification of the user’s utterance. The utterance is obviously not predetermined or known to the chatbot, and hence the importance and challenge of getting the intent prediction right.

For both generative and predictive a custom fine-tuned LLM model can be created. Currently fine-tuning LLMs is a novelty, but for mass adoption of LLMs in more formal and enterprise settings, fine-tuning will become mainstream.

The complete example below illustrates how to create an OpenAI GPT-3 (Ada) fine-tuned model for classification of text into one of two classes.

The image below shows the final product, where the fine-tuned model is listed under “fine-tunes”.

Let’s get started, below is the code to access the training data from Sklearn. The command below lists the categories of data archived from the original 20 newsgroups website.

from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')
from pprint import pprint
pprint(list(newsgroups_train.target_names))

These are the 20 categories available, from these we will make use of rec.autos and rec.motorcycles.

['alt.atheism',
'comp.graphics',
'comp.os.ms-windows.misc',
'comp.sys.ibm.pc.hardware',
'comp.sys.mac.hardware',
'comp.windows.x',
'misc.forsale',
'rec.autos',
'rec.motorcycles',
'rec.sport.baseball',
'rec.sport.hockey',
'sci.crypt',
'sci.electronics',
'sci.med',
'sci.space',
'soc.religion.christian',
'talk.politics.guns',
'talk.politics.mideast',
'talk.politics.misc',
'talk.religion.misc']

The code to fetch the two categories we are interested in, also assign the data to vehicles_dataset .

from sklearn.datasets import fetch_20newsgroups
import pandas as pd
import openai

categories = ['rec.autos', 'rec.motorcycles']
vehicles_dataset = fetch_20newsgroups(subset='train', shuffle=True, random_state=42, categories=categories)

Below a record is printed of dataset:

print(vehicles_dataset['data'][10])

From the result below; it is evident that the data is very messy, and each entry is not clean and can very easily contain ambiguity.

From: stlucas@gdwest.gd.com (Joseph St. Lucas)
Subject: Re: Dumbest automotive concepts of all time
Organization: General Dynamics Corp.
Distribution: usa
Lines: 10

Don't have a list of what's been said before, so hopefully not repeating.

How about horizontally mounted oil filters (like on my Ford) that, no
matter how hard you try, will spill out their half quart on the bottom
of the car when you change them?

--
Joe St.Lucas stlucas@gdwest.gd.com Standard Disclaimers Apply
General Dynamics Space Systems, San Diego
Work is something to keep me busy between Ultimate Frisbee games.

Now we can check the number of records we have, and how many examples we have for autos and motorcycles.

len_all, len_autos, len_motorcycles = len(vehicles_dataset.data), len([e for e in vehicles_dataset.target if e == 0]), len([e for e in vehicles_dataset.target if e == 1])
print(f"Total examples: {len_all}, Autos examples: {len_autos}, Vehicles examples: {len_motorcycles}")

The printed result:

Total examples: 1192, Autos examples: 594, Vehicles examples: 598

The next step is converting the data into JSON format defined by OpenAI here. Below is an example of the format.

{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
...

The code to convert that data…

import pandas as pd

labels = [vehicles_dataset.target_names[x].split('.')[-1] for x in vehicles_dataset['target']]
texts = [text.strip() for text in vehicles_dataset['data']]
df = pd.DataFrame(zip(texts, labels), columns = ['prompt','completion']) #[:300]
df.head()

And the result, as seen in the Colab Notebook:

lastly, converting the data frame to a JSONL file named vehicles.jsonl:

df.to_json("vehicles.jsonl", orient='records', lines=True)

Now the OpenAI utility can be used to analyse the JSONL file.

!openai tools fine_tunes.prepare_data -f vehicles.jsonl -q

With the result of the analysis displayed below…

Analyzing...

- Your file contains 1192 prompt-completion pairs
- Based on your data it seems like you're trying to fine-tune a model for classification
- For classification, we recommend you try one of the faster and cheaper models, such as `ada`
- For classification, you can estimate the expected model performance by keeping a held out dataset, which is not used for training
- There are 5 examples that are very long. These are rows: [38, 203, 910, 1057, 1130]
For conditional generation, and for classification the examples shouldn't be longer than 2048 tokens.
- Your data does not contain a common separator at the end of your prompts. Having a separator string appended to the end of the prompt makes it clearer to the fine-tuned model where the completion should begin. See https://beta.openai.com/docs/guides/fine-tuning/preparing-your-dataset for more detail and examples. If you intend to do open-ended generation, then you should leave the prompts empty
- The completion should start with a whitespace character (` `). This tends to produce better results due to the tokenization we use. See https://beta.openai.com/docs/guides/fine-tuning/preparing-your-dataset for more details

Based on the analysis we will perform the following actions:
- [Recommended] Remove 5 long examples [Y/n]: Y
- [Recommended] Add a suffix separator `\n\n###\n\n` to all prompts [Y/n]: Y
- [Recommended] Add a whitespace character to the beginning of the completion [Y/n]: Y
- [Recommended] Would you like to split into training and validation set? [Y/n]: Y


Your data will be written to a new JSONL file. Proceed [Y/n]: Y

Wrote modified files to `vehicles_prepared_train.jsonl` and `vehicles_prepared_valid.jsonl`
Feel free to take a look!

Now use that file when fine-tuning:
> openai api fine_tunes.create -t "vehicles_prepared_train.jsonl" -v "vehicles_prepared_valid.jsonl" --compute_classification_metrics --classification_positive_class " motorcycles"

After you’ve fine-tuned a model, remember that your prompt has to end with the indicator string `\n\n###\n\n` for the model to start generating completions, rather than continuing with the prompt. Make sure to include `stop=["s"]` so that the generated texts ends at the expected place.
Once your model starts training, it'll approximately take 30.82 minutes to train a `curie` model, and less for `ada` and `babbage`. Queue will approximately take half an hour per job ahead of you.

Now we can start the training process and from this point an OpenAI api key is required.

The command to start the fine-tuning is a single line, with the foundation GPT-3 model defined at the end. In this case it is ada. I wanted to make use of davinci, but the cost is extremely high as opposed to ada which is one of the original base GPT-3 models.

!openai --api-key 'xxxxxxxxxxxxxxxxx' api fine_tunes.create -t "vehicles_prepared_train.jsonl" -v "vehicles_prepared_valid.jsonl" --compute_classification_metrics --classification_positive_class " autos" -m ada

The output from the training process.

Upload progress: 100% 1.35M/1.35M [00:00<00:00, 1.59Git/s]
Uploaded file from vehicles_prepared_train.jsonl: file-qN7D2kAh9h5Ui1XnZuDNrPgm
Upload progress: 100% 320k/320k [00:00<00:00, 623Mit/s]
Uploaded file from vehicles_prepared_valid.jsonl: file-ijOCUihdypRPrcTzodxN9Pa6
Created fine-tune: ft-xPIJ4BIM4giuXY4JOQ9rno2v
Streaming events until fine-tuning is complete...

(Ctrl-C will interrupt the stream, but not cancel the fine-tune)
[2023-01-31 05:44:52] Created fine-tune: ft-xPIJ4BIM4giuXY4JOQ9rno2v
[2023-01-31 05:46:21] Fine-tune costs $0.65
[2023-01-31 05:46:22] Fine-tune enqueued. Queue number: 0
[2023-01-31 05:46:24] Fine-tune started
[2023-01-31 05:49:03] Completed epoch 1/4
[2023-01-31 05:51:36] Completed epoch 2/4
[2023-01-31 05:54:07] Completed epoch 3/4
[2023-01-31 05:56:38] Completed epoch 4/4
[2023-01-31 05:57:11] Uploaded model: ada:ft-personal-2023-01-31-05-57-11
[2023-01-31 05:57:12] Uploaded result file: file-EzPswfO3vXl3RXqbIGr4qebS
[2023-01-31 05:57:12] Fine-tune succeeded

Job complete! Status: succeeded 🎉
Try out your fine-tuned model:

openai api completions.create -m ada:ft-personal-2023-01-31-05-57-11 -p <YOUR_PROMPT>

And lastly, the model is queried with an arbitrary sentence: So how do I steer when my hands aren't on the bars?

openai.api_key = "xxxxxxxxxxxxxxxxx"
ft_model = 'ada:ft-personal-2023-01-31-05-57-11'
sample_utterance ="""So how do I steer when my hands aren't on the bars?"""
res = openai.Completion.create(model=ft_model, prompt=sample_utterance + '\n\n###\n\n', max_tokens=1, temperature=0, logprobs=2)
res['choices'][0]['text']

The correct answer is given in motorcycles .

Another example with the sentence: Is countersteering like benchracing only with a taller seat, so your feet aren't on the floor?

ft_model = 'ada:ft-personal-2023-01-31-05-57-11'
sample_utterance ="""Is countersteering like benchracing only with a taller seat, so your feet aren't on the floor?"""
res = openai.Completion.create(model=ft_model, prompt=sample_utterance + '\n\n###\n\n', max_tokens=1, temperature=0, logprobs=2)
res['choices'][0]['text']

And again the correct result is given as motorcycles .

The process of fine-tuning LLMs are not currently receiving the attention it deserves. However, as production implementations of LLMs grow, there will be more focus on fine-tuning for enhanced performance.

--

--