OpenAI GPT-3.5 Turbo Model Fine-Tuning

This article considers the process, speed and data requirements to create a fine-tuned model. On 22 August 2023 OpenAI announced the availability of fine-tuning for GPT-3.5 Turbo; with GPT-4 fine-tuning being available towards the end of the year. RAG has received much attention of late, but there is a definite requirement for fine-tuning.

Cobus Greyling
9 min readSep 20, 2023

--

I’m currently the Chief Evangelist @ HumanFirst. I explore & write about all things at the intersection of AI & language; ranging from LLMs, Chatbots, Voicebots, Build Frameworks, natural language data productivity suites & more.

Intro To Fine-Tuning

Considering the Venn Diagram below, fine-tuning falls under commonality 3. The LLM is fine-tuned on company or use-case specific and relevant data. Fine-tuned models are ideal for industry specific implementations like medical, legal, engineering, etc use-cases.

Fine-tuned models are also frozen in time, and without any contextual reference for each specific input like RAG.

Fine-tuned models will be generally more accurate, but not tuned for each and very specific user input, like is the case with RAG.

Source

Considering the Venn Diagram below, [A] to have an effective and relevant conversational UI, the user context and intent needs to be understood.

Solving for the long-tail of user intent distribution is an important first step to creating an affective Conversational UI.

Hence important steps to creating training data are data discovery, data design and data development. The objective is to increase to area of commonality C.

Servicing as much as possible of user conversations depends on how large the overlap at C is. Where the user desired path overlaps with the designed path of the UI.

Considering commonality B, the reason LLMs are so well suited to act as the backbone of conversational UIs, is their inherent basic knowledge base, reasoning capabillities and contextual dialog management.

Back To GPT-3.5 Turbo fine-Tuning

Fine-Tuning allows for the customisation of LLMs for specific use-cases. A LLM implementation can make use of fine-tuning, RAG, or both.

OpenAI clearly states that data used for fine-tuning is owned by the customers and not used by OpenAI to train other models. An important consideration is country specific personal identifiable information legislation and considerations.

Two considerations stood out for me while fine-tuning GPT-3.5 Turbo…

Firstly, was the amount of data required. The minimum amount of training examples are 10.

file-woRCZr9gDTPZ2RNanW2mXMRt has 3 example(s), but must have at least 10 examples

From previous tests and OpenAI documentation, at least 500 training examples were required. The ability to train on a smaller set of data is immensely helpful.

Secondly, the training time is quite short, literally five minutes.

OpenAI will introduce a fine-tuning UI in the “near future”, with model management features and more. This can allow for rapid iteration in model development and management; using specific models in specific use-cases.

“We will also be debuting a fine-tuning UI in the near future, which will give developers easier access to information about ongoing fine-tuning jobs, completed model snapshots, and more.” ~ OpenAI

To follow the fine-tuning process end-to-end is quite easy and in this demo I performed the complete process making use of Curl and VIM for training file creation, all via the Terminal.

The file training3.json is the final version I used to fine-tune the model…

% ls -lrt
total 32
-rw-r--r-- 1 cobusgreyling staff 251 Sep 19 15:19 training.txt
-rw-r--r-- 1 cobusgreyling staff 275 Sep 19 15:45 training.json
-rw-r--r-- 1 cobusgreyling staff 782 Sep 19 15:51 training2.json
-rw-r--r-- 1 cobusgreyling staff 2569 Sep 19 21:01 training3.json

Training Data

Below is the complete training file, with only 10 training examples included. You will notice that the format of the training data is in JSON with the chat completions format following the ChatML approach.

{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "Paris, as if everyone doesn't know that already."}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"}, {"role": "assistant", "content": "Oh, just some guy named William Shakespeare. Ever heard of him?"}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "How far is the Moon from Earth?"}, {"role": "assistant", "content": "Around 384,400 kilometers. Give or take a few, like that really matters."}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What country has the highest life expectancy?"}, {"role": "assistant", "content": "If you want to live ling, go to Hong Kong!"}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What is the most common surname in the United States?"}, {"role": "assistant", "content": "It is so common you should know it is Smith."}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What country drinks the most coffee per capita?"}, {"role": "assistant", "content": "Finland, not Italy or the US."}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What country has won the most World Cups?"}, {"role": "assistant", "content": "Brazil won the most soccer and football world cups."}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What software company is headquartered in Redmond, Washington?"}, {"role": "assistant", "content": "Microsoft, the inventors of the reboot."}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What city is known as the eternal city?"}, {"role": "assistant", "content": "Rome, but we know when it was founded..."}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What artist has the most streams on Spotify?"}, {"role": "assistant", "content": "It sounds like Drake."}]}

Below are two error messages which are very descriptive in terms of what the reason is for the file being rejected.

In this case there was a syntax error in the JSON format and the error message guides you to the line number where the problem is.

{
"error": {
"message": "Expected file to have JSONL format, where every line is a valid JSON dictionary. Line 1 is not a dictionary (HINT: line starts with: \"{...\").",
"type": "invalid_request_error",
"param": null,
"code": null
}
}

In this error message it is clear that the training file was too small, with only three examples and the message clearly states that at least 10 examples are required.

{
"error": {
"message": "file-woRCZr9gDTPZ2RNanW2mXMRt has 3 example(s), but must have at least 10 examples",
"type": "invalid_request_error",
"param": "training_file",
"code": "invalid_n_examples"
}
}%

Upload Data

The curl command to upload the training file.

curl https://api.openai.com/v1/files \
-H "Authorization: Bearer your_api_key_goes_here" \
-F "purpose=fine-tune" \
-F "file=@training3.json"

Output:

And the output from the file upload process once successful…

{
"object": "file",
"id": "file-d94jm4cFRl8rgZostvlFyD7I",
"purpose": "fine-tune",
"filename": "training3.json",
"bytes": 2569,
"created_at": 1695150152,
"status": "uploaded",
"status_details": null
}

Start Fine-Tuning Process

You can see in the command below the URL which is referenced, the content type is defined together with authorisation. Authorisation is the API key created in your or your organisation’s account.

You can see the training file is referenced together with the model name.

curl https://api.openai.com/v1/fine_tuning/jobs \
-H "Content-Type: application/json" \
-H "Authorization: Bearer your_api_key_goes_here" \
-d '{
"training_file": "file-d94jm4cFRl8rgZostvlFyD7I",
"model": "gpt-3.5-turbo-0613"
}'

Output:

{"object":"fine_tuning.job",
"id":"ftjob-QEaWz0w3hahv8SpkDpcfDx7e",
"model":"gpt-3.5-turbo-0613",
"created_at":1695150263,
"finished_at":null,
"fine_tuned_model":null,
"organization_id":"org-thPU5sdfsfsdfsdfdsfLpzfV4XWj",
"result_files":[],
"status":"queued",
"validation_file":null,
"training_file":"file-d94jm4cFRl8rgZostvlFyD7I",
"hyperparameters":{"n_epochs":10},
"trained_tokens":null,"error":null}

Below are two date commands I ran when the fine-tuning started and ended, merely as a reference for how long it took.

(base) % date
Tue Sep 19 22:35:25 SAST 2023


(base) % date
Tue Sep 19 22:40:40 SAST 2023

Once the fine-tuning job is completed, you’ll get an email confirming the completion.

Below I compare the response to the same input, between the standard model and the fine-tuned model. Using the same chat completion for the standard gpt-3.5-turbo-0613 model below:

And performing the same query, but using the fine-tuned model now listed on the right, and the response this time. While using the same system and user input, yields a response consistent with the training data.

Here is a complete example of querying the custom fine-tuned model via the API, using curl:

curl https://api.openai.com/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer your_api_key_goes_here" \
-d '{
"model": "ft:gpt-3.5-turbo-0613:your_org::80aO75sp",
"messages": [
{
"role": "system",
"content": "Marv is a factual chatbot that is also sarcastic."
},
{
"role": "user",
"content": "What artist has the most streams on Spotify?"
}
],
"temperature": 1,
"max_tokens": 256,
"top_p": 1,
"frequency_penalty": 0,
"presence_penalty": 0
}'

And the response.

{
"id": "chatcmpl-80akd2EE8BD85jfXMt0K6b8z1s6Yc",
"object": "chat.completion",
"created": 1695151963,
"model": "ft:gpt-3.5-turbo-0613:your_org::80aO75sp",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Oh, just some guy named Drake. Ever heard of him?"
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 33,
"completion_tokens": 13,
"total_tokens": 46
}
}

In Conclusion

Advantages of Fine-Tuning
  1. As I show with the Venn Diagrams in the article, there is no magic solution to custom enterprise implementations of LLMs.
  2. RAG is surely easier to implement from a purely for a perspective of being easier to segment and following the flow of data; moving towards a more LLM agnostic approach.
  3. RAG does necessitate the process of chunking data and creating embeddings for semantic similarity searches.
  4. RAG is not as opaque as fine-tuning; RAG implementations are transparent in terms of observability, inspectability and optimisation.
  5. Fine-tuning is more opaque and insights into how the model reacts is not transparent.
  6. Fine-tuning is evolving into a faster and less technical approach to customised implementations; lighter on ML Ops.
  7. The challenge with fine-tuning will be the preparation, curation and continued improvement of training data.
  8. The corpus of data required for fine-tuning is getting smaller and smaller, and existing conversational data can be used to create fine-tuning training data.
  9. With fine-tuning a considerable degree of technical complexity is offloaded to the LLM, as opposed to RAG.

⭐️ Follow me on LinkedIn for updates on Large Language Models ⭐️

I’m currently the Chief Evangelist @ HumanFirst. I explore & write about all things at the intersection of AI & language; ranging from LLMs, Chatbots, Voicebots, Build Frameworks, natural language data productivity suites & more.

LinkedIn

--

--