The Importance Of Granular Data Design For Fine-Tuning

And leveraging Data Design to train LLMs to fully utilise context, while also solving for the Lost-In-The-Middle challenge.

Cobus Greyling
6 min readMay 2, 2024



Would conversation designers not be exceptional data designers?

This question has been lingering in the back of my mind for the last couple of days…

Allow me to explain, I have been talking much about a data strategy needing to consist of the following D’s: data discovery, data design, data development and data delivery.

Data delivery has been discuss much considering RAG and other delivery strategies. Data Discovery has also been addressed to some degree, for instance XO Platform’s Intent Discovery. However, there is still much to do and development opportunities…

Coming to Data Design…in this article I discuss three recent studies which focusses on teaching language models (both large and small) certain behaviours. While not necessarily imbuing the model with specific world knowledge, but rather improving the behaviour and abilities of the model.

These abilities can include self correction, reasoning abilities, improving contextual understanding, both short and long, and more…

Taking A Step Back…

There has been as shift within Large Language Model research, a shift in focus to designing training data in such a way, to greatly improve the reasoning capabilities of especially small language models (SLMs).

This new approach can be described as not only a Data First approach to AI, but a Data Design approach.

In recent times, there has been a notable emphasis on the data delivery aspect of Language Models (LLMs and SLMs alike). Specifically, the focus has centred around how to incorporate proprietary data into the language model during inference.

The process of data delivery can be categorised into two primary approaches: gradient and non-gradient. Non-gradient methods have garnered significant attention due to their transparency, contrasting with the more opaque nature of gradient and fine-tuning techniques.

Among the non-gradient methods, RAG stands out as the most widely adopted approach to data delivery, manifesting in various iterations.

What’s intriguing is that certain fine-tuning and gradient techniques don’t primarily aim to infuse the language model with domain-specific data.

Instead, their primary goal is to alter the model’s behaviour and instruct it in specific tasks through the design and structure of the fine-tuning data. These tasks encompass functionalities like reasoning, self-correction and handling large context better.

Data Design For Fine-Tuning Data

There are two recent studies in data design which stands out…the first was the approach Microsoft Research followed in the training of a Small Language Model (SLM) called Orca-2.

Prompt Erasure

Orca-2 is an open-sourced Small Language Model (SLM) which excels at reasoning. This is achieved by decomposing a problem and solving it step-by-step, which adds to observability and explainability.

This skill of reasoning was developed during fine-tuning of the SLM by means of granular and meticulous fine-tuning…

Nuanced training data was created, an LLM is presented with intricate prompts which is designed with the intention to elicit strategic reasoning patterns which should yield more accurate results.

During the training phase within the training data, the smaller model is exposed to the task and the subsequent output from the LLM. The output data of the LLM defines how the LLM went about in solving the problem.

However, the training data was designed (changed) in such a way to hide or remove the original prompt. Hence the original prompt is not shown to the SLM.

This was called an approach of Prompt Erasure. And it trained Orca-2 to be a cautious reasoner due to the fact that the model had to learn not only how to execute specific reasoning steps, but to strategise at a higher level on how to approach a particular task.

Rather than naively imitating powerful LLMs, the LLM is used as a reservoir of behaviours from which a judicious selection is made for the approach for the task at hand.

Partial Answer Masking (PAM)

A recent study introduced a pipeline for designing and generating self-correction training data by proposing a method called Partial Answer Masking (PAM), with the aim of enabling the model to self-correct internally through fine-tuning.

The objective of Partial Answer Masking is to guide the language model towards self-correction.

During the fine-tuning process, we propose Partial Answer Masking (PAM) to make the model have the ability of self- verification. ~ Source

The study conducted experiments using Language Models with parameter sizes ranging from 6 billion to 13 billion across two tasks.

To enhance the self-correcting capability of small language models, the study introduces Intrinsic Self-Correction (ISC), which relies on two core capacities: self-verification and self-modification.

During the fine-tuning process, Partial Answer Masking (PAM) is introduced to instil self-verification capabilities in the model.

This marks the first demonstration that even small language models with as few as 6 billion parameters possess inherent self-correction abilities during response generation, without relying on ground truth data.

The proposed Intrinsic Self-Correction seeks to embed self-correction as a natural pattern within Language Models, involving an autonomous and spontaneous self-correction mechanism, distinct from existing prompt engineering methods.

To equip Small Language Models with self-correction capabilities, a pipeline is devised for generating self-correction data and establishing a universal data format applicable for self-correction tasks.

Training Data Design For Large Context Training

In Short

Microsoft researches and collaborators have devised an approach to overcome the lost-in-the-middle problem.

Contemporary large language models (LLMs) can handle lengthy input but often struggle with fully utilising information within long contexts, known as the lost-in-the-middle challenge.

Microsoft researches and collaborators proposed an approach called INformation-INtensive (IN2) training to address this issue.

IN2 training uses a synthesised long-context question-answer dataset, focusing on:

  1. fine-grained information awareness
  2. in short segments within
  3. long contexts
  4. and integrating information from multiple segments.

This training was applied to Mistral-7B, creating FILM-7B (FILl-in-the-Middle).

According to the study, FILM-7B shows robust retrieval of information from different positions in its 32K context window across various context styles and retrieval patterns.

It also improves performance on real-world long-context tasks, such as NarrativeQA, while maintaining performance on short-context tasks.

⭐️ Follow me on LinkedIn for updates on Large Language Models ⭐️

I’m currently the Chief Evangelist @ Kore AI. I explore & write about all things at the intersection of AI & language; ranging from LLMs, Chatbots, Voicebots, Development Frameworks, Data-Centric latent spaces & more.




Cobus Greyling

I explore and write about all things at the intersection of AI & language; LLMs/NLP/NLU, Chat/Voicebots, CCAI.