Meta Prompting: A Practical Guide to Optimising Prompts Automatically

Discover how meta prompting can enhance your results by using advanced models to optimise prompts for simpler ones.

6 min readFeb 6, 2025

In this post, I’ll walk you through the process of refining a basic prompt to improve the quality of outputs from a language model.

I’ll use an adapted example of summarising news articles to demonstrate how it works, from the OpenAI Cookbook.

What is Meta Prompting?

Meta prompting is a technique where one Language Model (LLM) is used to generate or optimise prompts for another Language Model.

Typically, a more advanced model ( o1-preview) is employed to refine prompts for a less sophisticated model (GPT-4o).

The goal is to create prompts that are clearer, more structured, and better at guiding the target model to produce high-quality, relevant responses.

By leveraging the advanced reasoning capabilities of models like o1-preview, we can systematically enhance prompts to ensure they’re more effective in eliciting the desired outputs.

Why Use Meta Prompting?

This technique simplifies the development process when working with LLMs, making it easier to achieve better results.

It’s especially useful for tasks like summarisation, question-answering, or any scenario where precision matters..

How It Works

I’ll begin with a simple prompt designed to summarise news articles.

Then, using o1-preview, I’ll analyse and refine it step by step, adding clarity and detail to improve its effectiveness.

Finally, evaluate the outputs systematically to measure the impact of our changes.

The code below performs the necessary installs and imports. It will also prompt you for your OpenAI API key.

!pip install datasets openai pandas tqdm


import pandas as pd
import getpass
import matplotlib.pyplot as plt
from concurrent.futures import ThreadPoolExecutor, as_completed
from tqdm import tqdm
from pydantic import BaseModel
from datasets import load_dataset


# Initialize OpenAI client
api_key = getpass.getpass("Enter your OpenAI API key: ")
client = openai.OpenAI(api_key=api_key)

ds = load_dataset("RealTimeData/bbc_news_alltime", "2024-08")
df = pd.DataFrame(ds['train']).sample(n=100, random_state=1)
df.head()

Below the top part of the data which is loaded into the application.

First we are going to prompt the o1-preview with a simple prompt as defined below.

In order to improve the simple prompt, guidance and context needs to be given to the o1-preview model in order to achieve the desired results.

simple_prompt = "Summarize this news article: {article}"

meta_prompt = """
Improve the following prompt to generate a more detailed summary. 
Adhere to prompt engineering best practices. 
Make sure the structure is clear and intuitive and contains the type of news, tags and sentiment analysis.

{simple_prompt}

Only return the prompt.
"""

def get_model_response(messages, model="o1-preview"):
    response = client.chat.completions.create(
        messages=messages,
        model=model,
    )
    return response.choices[0].message.content


complex_prompt = get_model_response([{"role": "user", "content": meta_prompt.format(simple_prompt=simple_prompt)}])
complex_prompt

And here is the enhanced prompt generated by our advanced or superior model:

Please read the following news article and provide a detailed summary that includes:

- **Type of News**: Categorize the article (e.g., Politics, Technology, Sports, Business, etc.).

- **Tags**: List relevant keywords or phrases associated with the article.

- **Sentiment Analysis**: Analyze the overall tone (e.g., Positive, Negative, Neutral) and provide a brief explanation.

Ensure the summary is well-structured and easy to understand.

**Article**: {article}

OK, now we have our simple prompt and advanced prompt. So we can now use the simple and advanced prompts on your dataset and compare the results.

We are running the prompts now against our inferior model.

def generate_response(prompt): 
    messages = [{"role": "user", "content": prompt}]
    response = get_model_response(messages, model="gpt-4o-mini")
    return response

def generate_summaries(row):
    simple_itinerary = generate_response(simple_prompt.format(article=row["content"]))
    complex_itinerary = generate_response(complex_prompt + row["content"])
    return simple_itinerary, complex_itinerary

generate_summaries(df.iloc[0])

Comparing the summaries from the simple and enhanced prompts reveals a noticeable improvement.

While the initial summary offers a broad overview, the enhanced version goes further, providing a detailed breakdown, categorising the news type, listing relevant tags, and even analysing sentiment.

Testing the entire dataset now…

# Add new columns to the dataframe for storing itineraries
df['simple_summary'] = None
df['complex_summary'] = None

# Use ThreadPoolExecutor to generate itineraries concurrently
with ThreadPoolExecutor() as executor:
    futures = {executor.submit(generate_summaries, row): index for index, row in df.iterrows()}
    for future in tqdm(as_completed(futures), total=len(futures), desc="Generating Itineraries"):
        index = futures[future]
        simple_itinerary, complex_itinerary = future.result()
        df.at[index, 'simple_summary'] = simple_itinerary
        df.at[index, 'complex_summary'] = complex_itinerary

df.head()

And the result…

Evaluating the Results

To compare the performance of the two prompts, use a structured approach where the language model itself acts as the judge.

This means the LLM will evaluate outputs based on specific criteria like accuracy, clarity, and relevance, providing an objective assessment without human bias. For more details, check out the OpenAI Evals cookbook.

evaluation_prompt = """
You are an expert editor tasked with evaluating the quality of a news article summary. Below is the original article and the summary to be evaluated:

**Original Article**:  
{original_article}

**Summary**:  
{summary}

Please evaluate the summary based on the following criteria, using a scale of 1 to 5 (1 being the lowest and 5 being the highest). Be critical in your evaluation and only give high scores for exceptional summaries:

1. **Categorization and Context**: Does the summary clearly identify the type or category of news (e.g., Politics, Technology, Sports) and provide appropriate context?  
2. **Keyword and Tag Extraction**: Does the summary include relevant keywords or tags that accurately capture the main topics and themes of the article?  
3. **Sentiment Analysis**: Does the summary accurately identify the overall sentiment of the article and provide a clear, well-supported explanation for this sentiment?  
4. **Clarity and Structure**: Is the summary clear, well-organized, and structured in a way that makes it easy to understand the main points?  
5. **Detail and Completeness**: Does the summary provide a detailed account that includes all necessary components (type of news, tags, sentiment) comprehensively?  


Provide your scores and justifications for each criterion, ensuring a rigorous and detailed evaluation.
"""

class ScoreCard(BaseModel):
    justification: str
    categorization: int
    keyword_extraction: int
    sentiment_analysis: int
    clarity_structure: int
    detail_completeness: int

The evaluation is run and stored in the data-frame.

def evaluate_summaries(row):
    simple_messages = [{"role": "user", "content": evaluation_prompt.format(original_article=row["content"], summary=row['simple_summary'])}]
    complex_messages = [{"role": "user", "content": evaluation_prompt.format(original_article=row["content"], summary=row['complex_summary'])}]
    
    simple_summary = client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=simple_messages,
        response_format=ScoreCard)
    simple_summary = simple_summary.choices[0].message.parsed
    
    complex_summary = client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=complex_messages,
        response_format=ScoreCard)
    complex_summary = complex_summary.choices[0].message.parsed
    
    return simple_summary, complex_summary

# Add new columns to the dataframe for storing evaluations
df['simple_evaluation'] = None
df['complex_evaluation'] = None

# Use ThreadPoolExecutor to evaluate itineraries concurrently
with ThreadPoolExecutor() as executor:
    futures = {executor.submit(evaluate_summaries, row): index for index, row in df.iterrows()}
    for future in tqdm(as_completed(futures), total=len(futures), desc="Evaluating Summaries"):
        index = futures[future]
        simple_evaluation, complex_evaluation = future.result()
        df.at[index, 'simple_evaluation'] = simple_evaluation
        df.at[index, 'complex_evaluation'] = complex_evaluation

df.head()

Below the result…

And the results are graphed…

import matplotlib.pyplot as plt

df["simple_scores"] = df["simple_evaluation"].apply(lambda x: [score for key, score in x.model_dump().items() if key != 'justification'])
df["complex_scores"] = df["complex_evaluation"].apply(lambda x: [score for key, score in x.model_dump().items() if key != 'justification'])


# Calculate average scores for each criterion
criteria = [
    'Categorisation',
    'Keywords and Tags',
    'Sentiment Analysis',
    'Clarity and Structure',
    'Detail and Completeness'
]

# Calculate average scores for each criterion by model
simple_avg_scores = df['simple_scores'].apply(pd.Series).mean()
complex_avg_scores = df['complex_scores'].apply(pd.Series).mean()


# Prepare data for plotting
avg_scores_df = pd.DataFrame({
    'Criteria': criteria,
    'Original Prompt': simple_avg_scores,
    'Improved Prompt': complex_avg_scores
})

# Plotting
ax = avg_scores_df.plot(x='Criteria', kind='bar', figsize=(6, 4))
plt.ylabel('Average Score')
plt.title('Comparison of Simple vs Complex Prompt Performance by Model')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.legend(loc='upper left', bbox_to_anchor=(1, 1))
plt.show()

Below the results…

Chief Evangelist @ Kore.ai | I’m passionate about exploring the intersection of AI and language. From Language Models, AI Agents to Agentic Applications, Development Frameworks & Data-Centric Productivity Tools, I share insights and ideas on how these technologies are shaping the future.

COBUS GREYLING

Where AI Meets Language | Language Models, AI Agents, Agentic Applications, Development Frameworks & Data-Centric…

www.cobusgreyling.com

https://platform.openai.com/docs/api-reference/chat/create#chat-create-reasoning_effort