Language Model Distillation Demystified
Model Distillation is the process of using Large Language Model output to fine-tune Small Language Models…
The aim is to achieve comparable performance with a SLM as opposed to a LLM for specific tasks.
The specific tasks the LLM is used for, together with the output are saved into the OpenAI Data Studio environment. This data is curated, tested, and used to train (fine-tune) a Small Language Model.
This is attractive for a number reasons…most use-cases are narrow and specific, so targeting specific tasks and optimising the SLM for these tasks make sense.
And added to this, additional advantages are reduced cost and latency and in general SLMs are more efficient.
Model Distillation allows you to leverage the outputs of a large model to fine-tune a smaller model, enabling it to achieve similar performance on a specific task. This process can significantly reduce both cost and latency, as smaller models are typically more efficient.
Hence for any organisation it would make sense to start with a Large Language Model (LLM), create a data repository of input and output pairs. Curate the data for accuracy and representation.
And then train a Small Language Model (SLM) on those datasets…hence the SLM is optimised for the behaviour and data expected.
The Process
The image below gives a basic breakdown of the sequence of events which can be followed to achieve model distillation…
In Short
In essence knowledge is transferred from an LLM to a SLM:
- The process starts with the LLM generating high-quality outputs
- These outputs are stored using the
store
parameter in the chat completions API - Both the LLM and SLM evaluate these stored completions to establish baseline performance.
- The most suitable completions are selected for the distillation process.
- The SLM is fine-tuned using these selected completions.
- The fine-tuned SLM’s performance is evaluated and compared to the original LLM
- The outcome determines whether the distillation was successful or if further fine-tuning is needed.
OpenAI Dashboard
The Python code below can be copied and pasted into a Notebook…you will be prompted for your OpenAI API key. The minimum amount of sets of data is 10 lines.
So if you are prototyping you can run this script 10 times and every time change the text of the user role.
What makes this little chat program save the interactions which will be saved for distillation fine-tuning later is the store=True
line.
pip install openai==0.28
import openai
import getpass
# Prompt for API key securely
api_key = getpass.getpass("Enter your OpenAI API key: ")
# Initialize OpenAI client
openai.api_key = api_key
# Create a chat completion request
response = openai.ChatCompletion.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a corporate IT support expert. Try to answer questions in one sentence."},
{"role": "user", "content": "What will the iPone of the future look like?"}
],
temperature=0.7,
logprobs=True,
store=True
)
# Print the response
print(response["choices"][0]["message"]["content"])
Looking at the image below, the Chat Completions pane will most probably also be blank in your case. As it was with me. At the top right hand corner of the dashboard you will see the indicator that the dashboard / UI is polling for saved completions.
OpenAI says:
When using the
store: true
option, completions are stored for 30 days. Your completions may contain sensitive information and so, you may want to consider creating a new Project with limited access to store these completions.
Once you start saving completions, it will show up as you see below in the OpenAI dashboard…
If you click on a line, the completion detail is shown on the right. There are various ways of filtering data based on date range, model, meta data, tools and general search according to input and output.
Below is the data discovery tool, data can added, edited and evaluation runs launched…what I like about the OpenAI UIs, it might not make sense for an enterprise from a cost, hosting and model diversity perspective.
But, if you want to actually follow a process end to end to get a better understanding, like model distillation, it is an excellent environment.
The image below shows how the data can be edited, to crate a new entry, I just copied the JSON format from the previous line.
Different testing criteria can be added, like factuality, specific strings that need to be included, semantic similarity and much more…
Back to the dashboard, the distillation process (fine-tuning the SLM with selected input and output from the LLM) can be started. Like most of these UIs, OpenAI keeps you updated with email notifications.
The training data is vetted first, and if there is any problem, you will get a notification.
The distillation settings, with base model, training data, validation data, new distilled model name, etc.
The process has started…
Fine-tuning is completed…
And I am able to chat to my new model via the prompt UI…
In Conclusion
There are a number of considerations to keep in mind…
This approach is a good step forward in terms of alignment and having a supervised process with input and model output is compared via a data studio.
The whole principle of model distillation I think is under appreciated and not leveraged as much as it should. And the fine-tuning of the language model should not be seen as a once-off process but it can actually be a continuous process.
I am reminded when I was still a project manager and daily we would update and train the natural language understanding model. Secondly, the ASR acoustic model was fine-tuned on a weekly basis.
Depending on cost obviously and testing the model subsequently, it feels to me like regularly fine-tuning can become a process that followed on a cadence very much the same we did with NLU and ASR.
There are also many studies that highlight the fact that find tuning should be used in conjunction with retrieval acquainted generation and should not be seen as superseding the other.
There is obviously down with OpenAI in terms of cost, and model deprecation which can lead to quite a bit of overhead in terms of managing the environment.
Chief Evangelist @ Kore.ai | I’m passionate about exploring the intersection of AI and language. From Language Models, AI Agents to Agentic Applications, Development Frameworks & Data-Centric Productivity Tools, I share insights and ideas on how these technologies are shaping the future.