A Benchmark for Verifying Chain-Of-Thought

A Chain-of-Thought is only as strong as its weakest link; a recent study from Google Research created a benchmark for Verifiers of Reasoning Chains

5 min readFeb 7, 2024

Introduction

There has been a significant number of studies performed on the topic of automated verification of LLM interactions. These methods evaluate LLM output and reasoning steps to evaluate and improve the correctness of responses.

In this study from Google Research, there is an observation that no fine-grained step-level datasets are available for the evaluation of chain-of-thought.

Hence the introduction of REVEAL, which is in essence a dataset to benchmark automatic verifiers of complex CoT for open-domain question answering.

REVEAL contains comprehensive labels for:

Relevance,
Attribution to evidence passages, and
Logical correctness

of each reasoning step in a language model’s answer, across a wide variety of datasets and state-of-the-art language models.

The resulting dataset, named REVEAL, serves as a benchmark for evaluating the performance of automatic verifiers of LM reasoning.

The dataset is available on HuggingFace.

Challenges

The research primarily focuses on evaluating verifiers that assess evidence attribution to a given source, rather than fact-checkers that perform evidence retrieval themselves.

Due to this focus, REVEAL can only assess verifiers that operate on specific provided evidence.

Some knowledge claims labeled as “unsupported” in the dataset may have supporting or contradictory evidence that was not surfaced by the retriever.

However, it’s important to note that the dataset aims to gather a diverse range of cases to evaluate verifiers, rather than to assess the CoT itself using an “ideal” retriever. The labels in the dataset are well-defined for the specific evidence passages used.

Practical Examples

REVEAL is an evaluation benchmark for verifying reasoning chains in a Chain-of-Thought format. REVEAL checks whether a reasoning chain is a correct justification to the final answer.

What is important to note, is that the answer can be correct even when the reasoning is incorrect.

The image above shows four verifiers in the middle column which verifies the correctness of a Chain-of-Thought on the left. The dataset is used to benchmark multiple verifiers, with the result shown on the right.

Below, a flowchart of the protocol for verifying reasoning and correctness step-by-step.

Considering the image below, a REVEAL instance is shown with labels for each step, for step type, relevance and correctness.

Annotation

As seen below, annotation was performed via a complex GUI and relies heavily on the annotators judgement.

The annotation task is split into two tasks; one task was focused on the logic annotation (including relevance, step type and logical correctness ratings).

Together with another task focused on the attribution annotations (including relevance and step-evidence attribution).

In the image below the annotation interface for the attribution task is shown.

And in the image below, the annotation GUI is shown for the task of annotating the logic.

Consideration

Something I find interesting from this study is the complexity involved in creating highly granular datasets.

A second consideration is the level of data design involved, considering the complex data structure. The GUI for data annotation is also complex and I’m quite sure it demanded a high level of expertise from the annotators.

It is evident that the process is time consuming, demanding a more complex annotation framework. This contributes to a higher cost of annotation.

⭐️ Follow me on LinkedIn for updates on Large Language Models ⭐️

I’m currently the Chief Evangelist @ Kore AI. I explore & write about all things at the intersection of AI & language; ranging from LLMs, Chatbots, Voicebots, Development Frameworks, Data-Centric latent spaces & more.

Get an email whenever Cobus Greyling publishes.

Get an email whenever Cobus Greyling publishes. By signing up, you will create a Medium account if you don’t already…

cobusgreyling.medium.com

A Chain-of-Thought Is as Strong as Its Weakest Link: A Benchmark for Verifiers of Reasoning Chains

Prompting language models to provide step-by-step answers (e.g., "Chain-of-Thought") is the prominent approach for…

arxiv.org

google/reveal · Datasets at Hugging Face

We're on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

The Chain-Of-X Phenomenon In LLM Prompting

In a recent post, I discussed the seemingly emergent abilities of LLMs. A new study argues that Emergent Abilities are…

cobusgreyling.medium.com

Self-Consistency For Chain-Of-Thought Prompting

The paper on Self-Consistency Prompting was published 7 March 2023, as an improvement on Chain-Of-Thought Prompting. At…

cobusgreyling.medium.com

Chain Of Natural Language Inference (CoNLI)

Hallucination is categorised into subcategories of Context-Free Hallucination, Ungrounded Hallucination &…

cobusgreyling.medium.com

Concise Chain-of-Thought (CCoT) Prompting

Traditional CoT comes at a cost of increased output token usage, CCoT prompting is a prompt-engineering technique which…

cobusgreyling.medium.com

Chain-of-Symbol Prompting (CoS) For Large Language Models

LLMs need to understand a virtual spatial environment described through natural language while planning & achieving…

cobusgreyling.medium.com

A Benchmark for Verifying Chain-Of-Thought

A Chain-of-Thought is only as strong as its weakest link; a recent study from Google Research created a benchmark for Verifiers of Reasoning Chains

Introduction

Challenges

Practical Examples

Annotation

Consideration

Get an email whenever Cobus Greyling publishes.

Get an email whenever Cobus Greyling publishes. By signing up, you will create a Medium account if you don’t already…

A Chain-of-Thought Is as Strong as Its Weakest Link: A Benchmark for Verifiers of Reasoning Chains

Prompting language models to provide step-by-step answers (e.g., "Chain-of-Thought") is the prominent approach for…

google/reveal · Datasets at Hugging Face

We're on a journey to advance and democratize artificial intelligence through open source and open science.

The Chain-Of-X Phenomenon In LLM Prompting

In a recent post, I discussed the seemingly emergent abilities of LLMs. A new study argues that Emergent Abilities are…

Self-Consistency For Chain-Of-Thought Prompting

The paper on Self-Consistency Prompting was published 7 March 2023, as an improvement on Chain-Of-Thought Prompting. At…

Chain Of Natural Language Inference (CoNLI)

Hallucination is categorised into subcategories of Context-Free Hallucination, Ungrounded Hallucination &…

Concise Chain-of-Thought (CCoT) Prompting

Traditional CoT comes at a cost of increased output token usage, CCoT prompting is a prompt-engineering technique which…

Chain-of-Symbol Prompting (CoS) For Large Language Models

LLMs need to understand a virtual spatial environment described through natural language while planning & achieving…

Written by Cobus Greyling

No responses yet