Steps In Evaluating Retrieval Augmented Generation (RAG) Pipelines

The basic principle of RAG is to leverage external data sources. For each user query or question, a contextual chunk of text is retrieved to inject into the prompt. This chunk of text is retrieved based on its semantic similarity with the user question. But how can a RAG implementation be tested and benchmarked over time?

Cobus Greyling
5 min readSep 4, 2023

--

I’m currently the Chief Evangelist @ HumanFirst. I explore & write about all things at the intersection of AI & language; ranging from LLMs, Chatbots, Voicebots, Development Frameworks, Data-Centric latent spaces & more.

Ragas follow a methodology of a “ground truth” answer for each line of test-set data. Ragas also advises that common questions should be included into the test data, with wording in the form the users will formulate it.

This points to one area in the LLM marketplace which is not adequately addressed; and that is the process of:

  • Observing and inspecting user interactions in terms of input and output data.
  • Managing and designing data in an accelerated, AI-supported latent space.
  • A streamlined process of updating chunked data to improve the response of the LLM-based interface.
  • The additional cost also needs to be taken into consideration, as Ragas uses LLMs to perform the RAG evaluations.

Ragas uses LLMs under the hood to perform evaluations but according to Ragas, they leverage LLMs in different ways to achieve the desired measurements. Ragas also use LangChain under the hood, which enables LangSmith integration (more about this in an upcoming article).

As seen in the image below, Ragas makes use of the following columns in the required data set:

Question: list[str] - These are the questions the RAG pipeline will be evaluated on.

Answer: list[str] - The answer generated from the RAG pipeline which will be presented to the user.

Contexts: list[list[str]] - The contexts which will be passed into the LLM to answer the question.

Ground Truths: list[list[str]] - The ground truth answer to the questions.

Source

Here you will find the complete working code for a Raga implementation.

The outputs from the Ragas evaluation are:

Faithfulness, Answer Relevancy, Context Relevancy and Context Recall.

Faithfulness: which measures the factual accuracy of the generated answer with the context provided.

This is performed in 2 steps. First, given a question and generated answer, Ragas uses an LLM to figure out the statements that the generated answer makes.

This gives a list of statements whose validity we have to be checked. In step 2, given the list of statements and the context returned, Ragas uses an LLM to check if the statements provided are supported by the context.

The number of correct statements is summed up and divided by the total number of statements in the generated answer to obtain the score for a given example.

Answer Relevancy: measures how relevant and to the point the answer is to the question. For a given generated answer Ragas uses an LLM to find out the probable questions that the generated answer would be an answer to and computes similarity to the actual question asked.

Context Relevancy: measures the signal-to-noise ratio in the retrieved contexts. Given a question, Ragas calls LLM to figure out sentences from the retrieved context that are needed to answer the question. A ratio between the sentences required and the total sentences in the context gives you the score

Context Recall: measures the ability of the retriever to retrieve all the necessary information needed to answer the question. Ragas calculates this by using the provided ground_truth answer and using an LLM to check if each statement from it can be found in the retrieved context. If it is not found that means the retriever was not able to retrieve the information needed to support that statement.

⭐️ Follow me on LinkedIn for updates on Large Language Models ⭐️

I’m currently the Chief Evangelist @ HumanFirst. I explore & write about all things at the intersection of AI & language; ranging from LLMs, Chatbots, Voicebots, Development Frameworks, Data-Centric latent spaces & more.

LinkedIn

--

--

Cobus Greyling

I explore and write about all things at the intersection of AI & language; LLMs/NLP/NLU, Chat/Voicebots, CCAI. www.cobusgreyling.com