LLM-Driven Synthetic Data Generation, Curation & Evaluation

Something I have found interesting is the various approaches being followed to generate synthetic training data for Language Models.

5 min readAug 2, 2024

--

Three key elements emerge: the necessity of human supervision, a well-planned data topology and pipeline for training data creation, and data designed to elicit specific behaviours from the language model, such as advanced reasoning.

Introduction

In training models, the challenge of balancing data quantity and quality is important. Large Language Models (LLMs) offer a data-centric solution by generating synthetic data. However, a recent study states that research in this area lacks a unified framework and remains superficial.

This paper organises relevant studies within a generic workflow of synthetic data generation, highlighting existing research gaps and suggesting future directions.

The goal is to guide academic and commercial communities toward more thorough investigations into LLM-driven synthetic data generation capabilities and applications.

Above a taxonomy of LLMs-driven synthetic data generation, curation and evaluation.

Tiny Stories & Phi-3

The use of Tiny Stories in training SLMs by Microsoft, and also how the Phi-3 models were trained, emphasised the impact data design can have on the behaviour of the model and that data quality is crucial for effective model learning.

LLMs enable us to actively shape what the models learn through data manipulation, greatly improving the effectiveness and control of model training.

As of June 2024, there are over 300 datasets on Hugging Face tagged as synthetic. Many mainstream LLMs, such as Alpaca, Vicuna, OpenHermes 2.5, and Openchat 3.5 leverage high-quality synthetic data for training.

Human Intervention

Data is essential for model intelligence and cannot be entirely generated without human oversight.

Synthetic data can introduce noise and toxic information, which may poison a model and lead to collapse.

Due to inherent biases, LLMs cannot self-correct and may deviate from intended goals. Thus, a human-friendly interactive system for annotation and verification is crucial. Currently, there is no standardised framework for human-machine collaboration in data production.

Such a system should be designed with a thorough understanding of human strengths and limitations, following human-centred principles.

Above, example prompts for data synthesis, annotation, multi-step generation and an integrated pipeline.

Key considerations include:

  • Ensuring readability and interpretability of LLM-generated information to facilitate human understanding.
  • Implementing upstream knowledge enrichment or filtering to optimise human resource use and reduce time spent on low-value tasks.
  • Adding engaging interactive features to make data processing tasks more enjoyable and attract a wider audience.

In traditional crowdsourced annotation, workers receive a codebook detailing the task purpose, data explanation, and background knowledge to better understand their jobs.

Similarly, for LLM-driven data generation, task specification is crucial and can include role-play, format clarification, and knowledge augmentation.

Context

A simple prompt like suppose you are a {xxx} can significantly improve LLM performance by setting the right context . This approach reminds of another study, where the researchers propose a new persona-driven data synthesis method that uses different perspectives within a large language model (LLM) to create varied synthetic data.

To support this method on a large scale, they introduce Persona Hub, a collection of 1 billion diverse personas automatically gathered from web data.

Faithfulness

To ensure valid supervision, generated data must be logically and grammatically coherent.

However, inherent issues like hallucination and the fat-tailed knowledge distribution in large language models (LLMs) can introduce significant noise. This often leads to factual errors, incorrect labels, or irrelevant content, particularly when generating long, complex, or domain-specific data.

Diversity

Diversity refers to the variations in generated data, such as differences in text length, topic, and writing style.

It is crucial for creating synthetic samples that reflect the diversity of real-world data, which helps prevent overfitting and bias during model training or evaluation.

However, inherent biases in large language models (LLMs) often result in monotonous and less diverse content, limiting its usefulness in downstream tasks.

Finally

The aim of synthetic data is not to imbue the target model knowledge, but rather train the model on certain personas and special abilities like advanced reasoning or task decomposition.

By combining strong data discovery and data design practices within a well-structured data topology, the process of creating synthetic data becomes more efficient, accurate, and aligned with real-world needs.

This foundational layer is essential for generating high-quality synthetic data that can effectively train and validate machine learning models.

⭐️ Follow me on LinkedIn for updates on Large Language Models ⭐️

I’m currently the Chief Evangelist @ Kore AI. I explore & write about all things at the intersection of AI & language; ranging from LLMs, Chatbots, Voicebots, Development Frameworks, Data-Centric latent spaces & more.

LinkedIn

--

--

I explore and write about all things at the intersection of AI & language; LLMs/NLP/NLU, Chat/Voicebots, CCAI. www.cobusgreyling.com