Steps In Evaluating Retrieval Augmented Generation (RAG) Pipelines

The basic principle of RAG is to leverage external data sources. For each user query or question, a contextual chunk of text is retrieved to inject into the prompt. This chunk of text is retrieved based on its semantic similarity with the user question. But how can a RAG implementation be tested and benchmarked over time?

5 min readSep 4, 2023

I’m currently the Chief Evangelist @ HumanFirst. I explore & write about all things at the intersection of AI & language; ranging from LLMs, Chatbots, Voicebots, Development Frameworks, Data-Centric latent spaces & more.

Ragas follow a methodology of a “ground truth” answer for each line of test-set data. Ragas also advises that common questions should be included into the test data, with wording in the form the users will formulate it.

This points to one area in the LLM marketplace which is not adequately addressed; and that is the process of:

Observing and inspecting user interactions in terms of input and output data.
Managing and designing data in an accelerated, AI-supported latent space.
A streamlined process of updating chunked data to improve the response of the LLM-based interface.
The additional cost also needs to be taken into consideration, as Ragas uses LLMs to perform the RAG evaluations.

Ragas uses LLMs under the hood to perform evaluations but according to Ragas, they leverage LLMs in different ways to achieve the desired measurements. Ragas also use LangChain under the hood, which enables LangSmith integration (more about this in an upcoming article).

As seen in the image below, Ragas makes use of the following columns in the required data set:

Question: list[str] - These are the questions the RAG pipeline will be evaluated on.

Answer: list[str] - The answer generated from the RAG pipeline which will be presented to the user.

Contexts: list[list[str]] - The contexts which will be passed into the LLM to answer the question.

Ground Truths: list[list[str]] - The ground truth answer to the questions.

Here you will find the complete working code for a Raga implementation.

The outputs from the Ragas evaluation are:

Faithfulness, Answer Relevancy, Context Relevancy and Context Recall.

Faithfulness: which measures the factual accuracy of the generated answer with the context provided.

This is performed in 2 steps. First, given a question and generated answer, Ragas uses an LLM to figure out the statements that the generated answer makes.

This gives a list of statements whose validity we have to be checked. In step 2, given the list of statements and the context returned, Ragas uses an LLM to check if the statements provided are supported by the context.

The number of correct statements is summed up and divided by the total number of statements in the generated answer to obtain the score for a given example.

Answer Relevancy: measures how relevant and to the point the answer is to the question. For a given generated answer Ragas uses an LLM to find out the probable questions that the generated answer would be an answer to and computes similarity to the actual question asked.

Context Relevancy: measures the signal-to-noise ratio in the retrieved contexts. Given a question, Ragas calls LLM to figure out sentences from the retrieved context that are needed to answer the question. A ratio between the sentences required and the total sentences in the context gives you the score

Context Recall: measures the ability of the retriever to retrieve all the necessary information needed to answer the question. Ragas calculates this by using the provided ground_truth answer and using an LLM to check if each statement from it can be found in the retrieved context. If it is not found that means the retriever was not able to retrieve the information needed to support that statement.

⭐️ Follow me on LinkedIn for updates on Large Language Models ⭐️

HumanFirst — Design, test and launch custom NLU and prompts

HumanFirst makes sense of unstructured data quickly. Pairing human-in-the-loop and AI-powered features, seamlessly…

www.humanfirst.ai

Get an email whenever Cobus Greyling publishes.

Get an email whenever Cobus Greyling publishes. By signing up, you will create a Medium account if you don’t already…

cobusgreyling.medium.com

RAG — Retrieval Augmented Generation

Large Language Models, RAG and data management.

cobusgreyling.medium.com

Retrieval Augmented Generation (RAG) Safeguards Against LLM Hallucination

A contextual reference increases LLM response accuracy and negates hallucination. In this article are a few practical…

cobusgreyling.medium.com

RAG Evaluation

Retrieval Augmented Generation (RAG) is a very popular framework or class of LLM Application. The basic principle of…

cobusgreyling.medium.com

GitHub - explodinggradients/ragas: Evaluation framework for your Retrieval Augmented Generation…

Evaluation framework for your Retrieval Augmented Generation (RAG) pipelines - GitHub - explodinggradients/ragas…

github.com

Evaluating RAG pipelines with Ragas + LangSmith

Editor's Note: This post was written in collaboration with the Ragas team. One of the things we think and talk about a…

blog.langchain.dev

Steps In Evaluating Retrieval Augmented Generation (RAG) Pipelines

HumanFirst — Design, test and launch custom NLU and prompts

HumanFirst makes sense of unstructured data quickly. Pairing human-in-the-loop and AI-powered features, seamlessly…

Get an email whenever Cobus Greyling publishes.

Get an email whenever Cobus Greyling publishes. By signing up, you will create a Medium account if you don’t already…

RAG — Retrieval Augmented Generation

Large Language Models, RAG and data management.

Retrieval Augmented Generation (RAG) Safeguards Against LLM Hallucination

A contextual reference increases LLM response accuracy and negates hallucination. In this article are a few practical…

RAG Evaluation

Retrieval Augmented Generation (RAG) is a very popular framework or class of LLM Application. The basic principle of…

GitHub - explodinggradients/ragas: Evaluation framework for your Retrieval Augmented Generation…

Evaluation framework for your Retrieval Augmented Generation (RAG) pipelines - GitHub - explodinggradients/ragas…

Evaluating RAG pipelines with Ragas + LangSmith

Editor's Note: This post was written in collaboration with the Ragas team. One of the things we think and talk about a…

Written by Cobus Greyling