Phi-4 Language Model

Most Language Models are trained on organic data sourced from the web, in this sense Phi-4 is different. Phi-4 strategically incorporates synthetic data in its training process.

5 min readDec 17, 2024

--

Phi-4 is can be described as a Small Language Model consisting of 14-billion parameters and in its training synthetic data was generated to train it with focus on reasoning focussed tasks.

Introduction

There has been quite a bit of criticism in the past regarding synthetic data…the quality…and the repetitive nature of the data. The fact that structures and patterns in the generating model can translate into the training data.

Despite efforts to ensure diversity, synthetic data can be constrained by the patterns and structures present in the generating model. This may lead to repetitive or overly generic data that doesn’t represent the full spectrum of real-world language use.

Microsoft Research has done quite a bit of work on the front of creating recipes in order to create diverse, fine-grained and high fidelity data.

Some examples of these recipes to create raining data is tiny-stories and PAM (Partial Answer Masking).

Other strategies employed by Microsoft Research include Seed Curation, Rewrite and Augment, Self-Revision, Validation of Code and Scientific Data and Instruction Reversal (More about instruction reversal later in the article).

Rewrite and Augment is where seeds are transformed into synthetic data through multi-step prompting workflows. This includes rewriting most of the useful content in given passages into exercises, discussions, or structured reasoning tasks.

The goal is to take the original content and turn it into formats that help the model learn better reasoning and interaction skills.

Data Design

AS I mentioned before, implementing techniques like Prompt Erasure and Partial Answer Masking (PAM) has shown significant improvements in the quality and reliability of SLM outputs.

In recent discussions about Language Models (both Large and Small), much of the focus has been on data delivery, how to supply proprietary or contextual data to the model at inference.

This process is typically categorised into two main approaches: gradient-based methods (e.g., fine-tuning) and non-gradient approaches (e.g., retrieval-augmented generation, or RAG).

Non-gradient approaches, like RAG, have gained prominence due to their transparency and simplicity compared to the more opaque nature of fine-tuning.

However, some gradient-based fine-tuning efforts have a different goal. Rather than embedding domain-specific data, these techniques aim to modify the model’s behavior. Fine-tuning structured datasets can help teach the model tasks like reasoning, self-correction, or structured workflows.

The focus in AI model training is shifting from data delivery to data design — designing datasets that train models for specific behaviours and abilities.

This approach tailors the format and structure of training data to imbue models with qualities like better reasoning, decision-making, and error correction.

Instead of just feeding the model more information, data design ensures that the training data guides the model’s performance and interactions in meaningful ways.

Comparison Between Organic & Synthetic Data

In organic datasets, the relationships between tokens are often complex and indirect, requiring multiple reasoning steps to predict the next token, which can make learning difficult for models.

In contrast, synthetic data is generated in a way that ensures each token logically follows the previous ones, making it easier for models to learn structured reasoning patterns.

This spoon feeding approach helps break down challenges into manageable steps, facilitating smoother learning.

Additionally, synthetic data is designed to resemble the types of outputs expected during inference, ensuring the model’s training experience aligns with real-world use cases.

For instance, facts found in web forums may not fit the style of LLM interactions, making them less accessible during chat. Converting these facts into a synthetic format similar to LLM dialogue helps the model retrieve them more effectively during inference.

The approach from Microsoft Research was to generating synthetic data for Phi-4 follows these key principles:

Diversity: The data must thoroughly represent a wide range of subtopics and skills within each domain, achieved by curating varied seeds from organic sources.

Nuance & Complexity: Training examples should capture the intricacies and depth of the domain, including non-trivial cases, edge cases, and advanced scenarios, rather than just basic content.

Accuracy: The generated data must be reliable, ensuring that code executes properly, proofs are valid, explanations align with established knowledge.

Chain-of-Thought: Data should encourage systematic reasoning, teaching the model various approaches to the problems in a step-by-step manner. This fosters coherent outputs for complex tasks.

This figure demonstrates that phi-4 achieves high accuracy on competition-level math tasks, performing better than expected for its size and even rivalling larger models with closed weights.

Instruction Reversal

Instruction Reversal is a technique used to create synthetic data that helps models better understand and generate outputs based on instructions.

For example, in this approach, existing code snippets are taken from a dataset and used to generate new instructions that describe the problem or task the code solves.

These new data pairs are then structured so that the instruction appears before the code, mimicking the natural flow of input-to-output interactions.

To ensure quality, only pairs where the regenerated code closely matches the original snippet are kept, maintaining a high degree of accuracy between the instruction and output.

This method enhances the model’s ability to follow prompts and produce relevant results and can be adapted for tasks beyond code, such as generating explanations, proofs, or other structured outputs from given instructions.

Chief Evangelist @ Kore.ai | I’m passionate about exploring the intersection of AI and language. From Language Models, AI Agents to Agentic Applications, Development Frameworks & Data-Centric Productivity Tools, I share insights and ideas on how these technologies are shaping the future.

--

--

Cobus Greyling
Cobus Greyling

Written by Cobus Greyling

I’m passionate about exploring the intersection of AI & language. www.cobusgreyling.com

Responses (2)