LLM Drift
A recent study coined the term LLM Drift. LLM Drift is definite changes in LLM responses and behaviour, over a relatively short period of time.
I’m currently the Chief Evangelist @ HumanFirst. I explore & write about all things at the intersection of AI & language; ranging from LLMs, Chatbots, Voicebots, Build Frameworks, natural language data productivity suites & more.
LLM Drift refers to significant alterations in LLM responses over a brief timeframe. It has nothing to do with inherent unpredictability of LLMs or minor amendments in prompt engineering, but instead involves a fundamental shift in the LLM.
A recent investigation discovered that the accuracy of responses by GPT-4 and GPT-3.5 sees substantial fluctuations in a positive direction over a four-month period. And worryingly, in the opposite direction as well.
Notably, the study shows significant variations in both GPT-3.5 and GPT-4, observing performance degradation in certain tasks.
Our findings highlight the need to continuously monitor LLMs behaviour over time. — Source
The most notable changes observed during the study were:
- Chain-Of-Thought efficiency saw changes, with GPT-4 being less likely to answer questions which yields an opinion.
- The decrease in opinionated answers can be related to an improvement of safety questions.
- Tendencies of deviation for GPT4 and GPT3.5 are often different
- Even-though LLM improvements can be achieved with fine-tuning and contextual prompt injection (RAG) unexpected behaviour will still be present.
- The researchers stressed continued testing and benchmarking, due to the fact that the study’s analysis was largely based on shifts in broader accuracy as the main metric. However, fine-grained investigations could disclose additional interesting shift patterns.
The schematic below shows the fluctuation in model accuracy over a period of four months. In some cases the deprecation is quite stark, being more than 60% loss in accuracy.
The table below shows Chain-Of-Thought (CoT) effectiveness drifts over time for prime testing.
Without CoT prompting, both GPT-4 and GPT-3.5 achieved relatively low accuracy.
With CoT prompting, GPT-4 in March achieved a 24.4% accuracy improvement, which dropped by -0.1% in June. It does seem like GPT-4 loss the ability to optimise the CoT prompting technique.
Considering GPT-3.5 , the CoT boost increased from 6.3% in March to 15.8% in June.
The datasets used and basic code examples from the study are available on GitHub. I also added an executed notebook which you can view here.
The GitHub repository also holds the datasets and generated content. Each csv file corresponds to one dataset with one record/row corresponding with one query and the generation from one LLM service.
⭐️ Follow me on LinkedIn for updates on Large Language Models ⭐️
I’m currently the Chief Evangelist @ HumanFirst. I explore & write about all things at the intersection of AI & language; ranging from LLMs, Chatbots, Voicebots, Build Frameworks, natural language data productivity suites & more.