Catastrophic Forgetting In LLMs

Catastrophic forgetting (CF) refers to a phenomenon where a LLM tends to lose previously acquired knowledge as it learns new information.

5 min readFeb 22, 2024


This can also be referred to as model drift.


There were two papers recently published which addressed a very similar observation with Large Language Models (LLMs). This observation addresses the problem of models not only exhibiting model drift, but also performance deprecation over time.

Hence Generative Apps (Gen-Apps) and LLM-based Conversational UI’s which are based on large LLMs are at the mercy of these model changes.

Imagine curating and testing every aspect of your customer facing application in terms of user experience, design affordances and the rest; but facing this obstacle of continued changes to the foundation of the applications.

It might be that these changes are chalked up to the non-deterministic nature of LLMs; however, recent studies prove that models do change over time. And that the changes are not for the better, models to not improve, but actually faces a deprecation in performance.

Non-determinism in the context of LLMs, means that models produce different outputs for the same input.

Catastrophic Forgetting

The term Catastrophic Forgetting was coined in a recent study, and refers to the tendency of LLMs, to lose or forget previously learned information as the model is trained on new data or fine-tuned for specific tasks.

This phenomenon may occur due to the limitations of the training process, as model training usually prioritises recent data or tasks at the expense of earlier data.

As a result, the model’s representations of certain concepts or knowledge may degrade or be overwritten by newer information, leading to a loss of overall performance or accuracy on tasks that require a broad understanding of diverse topics.

This can pose challenges in scenarios where continual learning or adaptation is necessary, as the model may struggle to maintain a balanced and comprehensive understanding over time.

LLM Drift

GPT-3.5 and GPT-4 are two widely used large language model (LLM) services and updates to these models over time are not transparent.

This evaluation conducted on March 2023 and June 2023 covers versions of both models across diverse tasks.

Performance and behaviour of GPT-3.5 and GPT-4 varied significantly over time.

For example…

  • GPT-4 (March 2023) performed well in identifying prime versus composite numbers (84% accuracy), but GPT-4 (June 2023) showed poor performance (51% accuracy), attributed partly to a decline in following chain-of-thought prompting.
  • GPT-3.5 improved in June compared to March in certain tasks.
  • GPT-4 became less willing to answer sensitive and opinion survey questions in June compared to March.
  • GPT-4 performed better at multi-hop questions in June, while GPT-3.5’s performance dropped.
  • Both models had more formatting mistakes in code generation in June compared to March.
  • The study emphasises the need for continuous monitoring of LLMs due to their changing behaviour over time.

Evidence suggests GPT-4’s ability to follow user instructions decreased over time, contributing to behaviour drifts.

The table below shows Chain-Of-Thought (CoT) effectiveness drifts over time for prime testing.

Without CoT prompting, both GPT-4 and GPT-3.5 achieved relatively low accuracy.

With CoT prompting, GPT-4 in March achieved a 24.4% accuracy improvement, which dropped by -0.1% in June. It does seem like GPT-4 loss the ability to optimise the CoT prompting technique.

Considering GPT-3.5 , the CoT boost increased from 6.3% in March to 15.8% in June.

The schematic below shows the fluctuation in model accuracy over a period of four months. In some cases the deprecation is quite stark, being more than 60% loss in accuracy.

In Closing

The study on the catastrophic forgetting (CF) of LLMs during continual fine-tuning found that CF generally exists in the continual fine-tuning of different LLMs.

And with the increase of scales, models suffer a stronger forgetting in domain knowledge, reasoning, and reading comprehension.

The study also states that instruction tuning may help mitigate the CF problem.

⭐️ Follow me on LinkedIn for updates on LLMs ⭐️

I’m currently the Chief Evangelist @ Kore AI. I explore & write about all things at the intersection of AI & language; ranging from LLMs, Chatbots, Voicebots, Development Frameworks, Data-Centric latent spaces & more.




I explore and write about all things at the intersection of AI & language; LLMs/NLP/NLU, Chat/Voicebots, CCAI.