RAG Evaluation

Retrieval Augmented Generation (RAG) is a very popular framework or class of LLM Application. The basic principle of RAG is to leverage external data sources to give LLMs contextual reference. In the recent past, I wrote much on different RAG approaches and pipelines. But how can we evaluate, measure and quantify the performance of a RAG pipeline?

5 min readSep 1, 2023

I’m currently the Chief Evangelist @ HumanFirst. I explore & write about all things at the intersection of AI & language; ranging from LLMs, Chatbots, Voicebots, Development Frameworks, Data-Centric latent spaces & more.

Any RAG implementation has two aspects: Generation and Retrieval. The context is established via the retrieval process. Generation is performed by the LLM, which generates the answer by using the retrieved information.

When evaluating a RAG pipeline, both of these elements need to be evaluated separately and together to get an overall score as well as the individual scores to pinpoint the aspects to improve.

Ragas uses LLMs to evaluate a RAG pipelines while also providing actionable metrics using as little annotated data as possible.

Ragas references the following data:

Question: These are the questions you RAG pipeline will be evaluated on.

Answer: The answer generated from the RAG pipeline and presented to the user.

Contexts: The contexts passed into the LLM to answer the question.

Ground Truths: The ground truth answer to the questions.

The following output is produced by Ragas:

Retrieval: context_relevancy and context_recall which represents the measure of the performance of your retrieval system.

Generation: faithfulness which measures hallucinations and answer_relevancy which measures the answers to question relevance.

The harmonic mean of these 4 aspects gives you the ragas score which is a single measure of the performance of your QA system across all the important aspects. (Source)

Considering the data, the questions should be representative of user questions.

The example below uses a dataset with the fields for: Index, Question, Ground Truth, Answer and Reference Context.

Here is a complete working code example to run your own application, all you will need is a OpenAI API Key, as seen below.

pip install ragas
pip install tiktoken

import os
os.environ["OPENAI_API_KEY"] = "xxxxxxxxxxxxxxxxxxxxxxxxxx"

# data
from datasets import load_dataset

fiqa_eval = load_dataset("explodinggradients/fiqa", "ragas_eval")
fiqa_eval

Output:

DatasetDict({
    baseline: Dataset({
        features: ['question', 'ground_truths', 'answer', 'contexts'],
        num_rows: 30
    })
})

from ragas.metrics import (
    context_relevancy,
    answer_relevancy,
    faithfulness,
    context_recall,
)
from ragas.metrics.critique import harmfulness

from ragas import evaluate

And…

result = evaluate(
    fiqa_eval["baseline"].select(range(3)),
    metrics=[
        context_relevancy,
        faithfulness,
        answer_relevancy,
        context_recall,
        harmfulness,
    ],
)

result

Ragas output:

{
     'ragas_score': 0.3482, 

     'context_ relevancy': 0.1296, 
     'faithfulness': 0.8889, 
     'answer_relevancy': 0.9287, 
     'context_recall': 0.6370, 

     'harmfulness': 0.0000
}

To view the data:

df = result.to_pandas()
df.head()

And the output below, the question is visible, the ground truth text, and the answer with the context. On the right is the context relevancy score, faithfulness score, answer relevancy, context-recall and harmfulness scores.

Lastly, I have a question for the community…there is obviously a need to observe, inspect and fine-tune data.

And in this case it is the data RAG accesses, and how it can be improved with an enhanced chunking strategy. Or the embedding model can improve. Or, the prompt at the heart of the RAG implementation can be optimised.

But this brings us back to the importance of data management, ideally via a data-centric latent space. Intelligently managing and updating data used for bench-marking will become increasingly important.

⭐️ Follow me on LinkedIn for updates on Large Language Models ⭐️

HumanFirst — Design, test and launch custom NLU and prompts

HumanFirst makes sense of unstructured data quickly. Pairing human-in-the-loop and AI-powered features, seamlessly…

www.humanfirst.ai

Get an email whenever Cobus Greyling publishes.

Get an email whenever Cobus Greyling publishes. By signing up, you will create a Medium account if you don’t already…

cobusgreyling.medium.com

Evaluating RAG pipelines with Ragas + LangSmith

Editor's Note: This post was written in collaboration with the Ragas team. One of the things we think and talk about a…

blog.langchain.dev

RAG — Retrieval Augmented Generation

Large Language Models, RAG and data management.

cobusgreyling.medium.com

Emerging Large Language Model (LLM) Application Architecture

Due to the highly unstructured nature of Large Language Models (LLMs), there are thought and market shifts taking place…

cobusgreyling.medium.com

GitHub - explodinggradients/ragas: Evaluation framework for your Retrieval Augmented Generation…

Evaluation framework for your Retrieval Augmented Generation (RAG) pipelines - GitHub - explodinggradients/ragas…

github.com

ragas/docs/quickstart.ipynb at main · explodinggradients/ragas

Evaluation framework for your Retrieval Augmented Generation (RAG) pipelines - ragas/docs/quickstart.ipynb at main ·…

github.com

RAG Evaluation

HumanFirst — Design, test and launch custom NLU and prompts

HumanFirst makes sense of unstructured data quickly. Pairing human-in-the-loop and AI-powered features, seamlessly…

Get an email whenever Cobus Greyling publishes.

Get an email whenever Cobus Greyling publishes. By signing up, you will create a Medium account if you don’t already…

Evaluating RAG pipelines with Ragas + LangSmith

Editor's Note: This post was written in collaboration with the Ragas team. One of the things we think and talk about a…

RAG — Retrieval Augmented Generation

Large Language Models, RAG and data management.

Emerging Large Language Model (LLM) Application Architecture

Due to the highly unstructured nature of Large Language Models (LLMs), there are thought and market shifts taking place…

GitHub - explodinggradients/ragas: Evaluation framework for your Retrieval Augmented Generation…

Evaluation framework for your Retrieval Augmented Generation (RAG) pipelines - GitHub - explodinggradients/ragas…

ragas/docs/quickstart.ipynb at main · explodinggradients/ragas

Evaluation framework for your Retrieval Augmented Generation (RAG) pipelines - ragas/docs/quickstart.ipynb at main ·…

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Cobus Greyling

Responses (4)