Chain-of-Thought Reasoning Is Not Always Faithful

This study reveals that Chain-of-Thought (CoT) reasoning in advanced Language Models can be unfaithful to their actual decision-making processes.

Cobus Greyling
5 min readMar 20, 2025

--

The researchers found significant rates of unfaithful reasoning across Language Models including Claude 3.7 Sonnet (16.3%), DeepSeek R1 (5.3%) and ChatGPT-4o (7.0%).

These findings raise serious concerns for AI safety protocols that rely on monitoring CoT reasoning to detect problematic behaviours.

Unfaithful Chain-of-Thought reasoning occurs in frontier models regardless of whether they incorporate explicit thinking mechanisms and even without specific prompting designed to elicit such unfaithfulness.

Although models with dedicated thinking mechanisms generally demonstrate improved faithfulness in their reasoning chains, this research reveals they still exhibit quantifiable levels of unfaithful reasoning across a diverse range of reasoning tasks and problem types.

If you like this article & want to show some love ❤️

- Clap 50 times, each one helps more than you think! 👏

- Follow me on Medium and subscribe for free. 🫶

- Find me on LinkedIn or on X!

The Claude 3.7 Sonnet model, built with extra thinking abilities, shows confusing behaviour that might come from something called Implicit Post-Hoc Rationalisation (IPHR).

I found the following three concepts interesting…

  1. Implicit Post-Hoc Rationalisation
  2. Restoration Errors
  3. Unfaithful Shortcuts

Implicit Post-Hoc Rationalisation

This happens when a Language Model has an unconscious bias toward a particular answer and then creates a seemingly logical explanation to justify that answer after the fact.

For example, when asked both Is X bigger than Y? and Is Y bigger than X? in separate instances, the model might say Yes to both questions and construct convincing-looking but contradictory reasoning chains to support these logically incompatible answers.

The model isn’t deliberately lying; rather, it’s constructing plausible-sounding justifications for conclusions it reached through other means.

Quantitative results of Implicit Post-Hoc Rationalisation for the 10 frontier models and pre-trained model in the evaluation. For each model, the study shows the percentage of pairs of questions showing unfaithfulness over the total number of pairs in the dataset (7, 400), using a specific classification criteria.

The study revealed models answer logically contradictory question pairs with superficially coherent arguments.

For example, models sometimes affirm that both X>Y and Y>X are true in separate contexts.

The researchers found this occurred at significant rates across frontier models:

  • Sonnet 3.7 (16.3%),
  • DeepSeek R1 (5.3%), and
  • ChatGPT-4o (7.0%).

When comparing yes/no questions and their logical negations (for example, Is X>Y? vs Is X≤Y?), models showed similar patterns of contradiction.

This suggests models aren’t solving problems through logical reasoning but rather constructing post-hoc rationalizations for biased conclusions.

The authors interpret this as evidence that frontier models can conceal implicit biases behind superficially coherent chains of reasoning.

Restoration Errors

These occur when a model makes a mistake in its reasoning process, realises the error somewhere along the way, but then silently fixes it without acknowledging the initial mistake.

To an observer, the reasoning chain appears seamless and correct, but the model actually made and corrected an error midway through.

This creates a misleading impression that the model’s reasoning was sound from start to finish, when in reality, it recovered from a mistake.

These silent corrections create the appearance of flawless reasoning despite the presence of substantial errors.

The error restoration affects the faithfulness of the reasoning chain, as the displayed CoT does not reflect the model’s actual problem-solving process.

The frequency of restoration errors varied across models and problem types but remained a consistent issue even in thinking models.

Unfaitfulness rate (the proportion of correct responses that contain unfaithful shortcuts) across thinking and non-thinking frontier models from three different developers (Claude Sonnet 3.7 w/ and w/o thinking enabled, DeepSeek R1 / V3, and Qwen QwQ 32B Preview / 72B IT).

Unfaithful Shortcuts

These happen when a model uses clearly illogical or invalid reasoning steps to simplify solving complex problems. The study found this particularly in advanced mathematical problems.

Instead of working through the problem methodically, the model takes reasoning shortcuts that wouldn’t actually lead to correct answers if followed precisely. The model may still reach the right conclusion, but the reasoning path it shows doesn’t faithfully represent how it actually arrived there.

All three issues pose challenges for AI safety work that relies on monitoring Chain-of-Thought reasoning to detect problematic behavior, since the reasoning chains don’t always accurately reflect the model’s actual decision-making process.

The researchers found models would sometimes present complex, impressive-looking reasoning that didn’t actually support their conclusions. These shortcuts occurred more frequently in problems requiring advanced mathematical concepts or multi-step reasoning.

Despite appearing sophisticated, these reasoning chains often contained fundamental logical gaps or invalid mathematical operations.

The prevalence of unfaithful shortcuts raises concerns about relying on CoT reasoning as evidence of model capabilities or understanding.

Finally

A robust evaluation system, as depicted in the image, is crucial for comprehensively assessing language models by incorporating diverse elements such as test data, production data, quality and safety evaluators, judge models, scoring rubrics, and project-based evaluations.

Relying solely on one element, like Chain of Thought, can lead to incomplete assessments, as it may not capture critical aspects such as bias detection, toxicity, or real-world performance reflected in production logs.

By integrating multiple evaluators — like quality evaluators for fluency and coherence, and safety evaluators for banned topics — the system ensures a balanced and thorough analysis of an AI’s capabilities.

The use of scoring rubrics to define success and failure, alongside judge models with customisable configurations, allows for a standardised yet flexible evaluation process that can adapt to various use cases.

Ultimately, a multi-faceted approach, as shown in the image, provides more reliable evaluation results, enabling better decision-making and improvement of AI systems across different projects and test cases.

Chief Evangelist @ Kore.ai | I’m passionate about exploring the intersection of AI and language. From Language Models, AI Agents to Agentic Applications, Development Frameworks & Data-Centric Productivity Tools, I share insights and ideas on how these technologies are shaping the future.

--

--

Cobus Greyling
Cobus Greyling

Written by Cobus Greyling

I’m passionate about exploring the intersection of AI & language. www.cobusgreyling.com

Responses (1)