Using LLMs with RAG In Chatbots, what can go wrong?

Language Models are ideal for chatbots because they excel in Natural Language Generation (NLG), managing conversational context and history & handling complex intent and entity management. Solving for all the traditional ailments in Chatbot development frameworks.

4 min readJan 27, 2025

However, results from a fairly recent study indicate that while RAG can improve accuracy in certain scenarios, it remains vulnerable to prompts that contradict the model’s pre-trained knowledge.

This underscores the complexity of hallucinations and the need for stronger strategies to enhance LLM reliability in practical use cases.

When context was supplied along with question prompts, these responses were 98.88% more likely to contain valid URLs.

Noisy Context

Noisy Context was the most frequent problem, seen in 38.1% of responses.

Here, the model would often extend information from one section of a CV (like education) into another (like work history), creating responses that mixed contexts inappropriately.

For instance, when detailing the educational background of a subject, the model might continue to list employment history, mistakenly treating the maximum response length as a goal rather than a limit.

This behaviour was considered hallucination-adjacent as it delivered credible yet inappropriate information.

With context, subjects indicated that the model correctly navigated the text to produce accurate responses approximately 94% of the time.

Mismatch Between Instruction & Context

Mismatch between Instruction and Context occurred in 19% of the responses where the model couldn’t align the query with the available context.

For example, when asked about work experience for a student with none, the model would admit there was no relevant information in the context, sometimes even apologising for not being able to generate content.

Context-Based Synthesis

Context-based Synthesis was another issue in 19% of cases, where the model would invent or extrapolate information when context was missing or distorted.

One subject’s empty work history was filled with a plausible but entirely fabricated job list, highlighting the model’s ability to create believable yet false narratives.

Unusual Formatting

Unusual Formatting also affected 19% of responses, where non-standard CV formats led to erroneous connections between unrelated pieces of information, like job dates and titles from different entries.

This again was seen as a form of hallucination where the information seemed credible but was incorrect.

Incomplete Context

Lastly, Incomplete Context was observed in one response where a DOI was provided but not utilised for further information retrieval, leading to a response that was strictly limited to the text provided in the context.

This highlighted a common user expectation that LLMs should have capabilities akin to information retrieval systems.

These findings suggest that while models can effectively use context to enhance responses, there are significant challenges in ensuring the accuracy and relevance of information, particularly when dealing with complex or non-standard data structures.

In Closing

Context improves RAG system responses, but even with accurate context, errors occur, pointing to the necessity for advanced context management and prompt engineering to enhance system reliability.

Hallucination

Models can generate credible but false information by blending factual content, posing a significant risk that needs further research.

Formatting &Context

Unusual formatting and incomplete context can degrade the reliability of RAG responses.

User Expectations

The public has certain expectations about LLMs, highlighted by an instance where a participant was misled by a DOI, suggesting a need for better user education and prompt engineering.

Chief Evangelist @ Kore.ai | I’m passionate about exploring the intersection of AI and language. From Language Models, AI Agents to Agentic Applications, Development Frameworks & Data-Centric Productivity Tools, I share insights and ideas on how these technologies are shaping the future.

RAGged Edges: The Double-Edged Sword of Retrieval-Augmented Chatbots

Large language models (LLMs) like ChatGPT demonstrate the remarkable progress of artificial intelligence. However…

arxiv.org

COBUS GREYLING

Where AI Meets Language | Language Models, AI Agents, Agentic Applications, Development Frameworks & Data-Centric…

www.cobusgreyling.com

x.com

Edit description

x.com

https://www.linkedin.com/in/cobusgreyling/

Using LLMs with RAG In Chatbots, what can go wrong?

Language Models are ideal for chatbots because they excel in Natural Language Generation (NLG), managing conversational context and history & handling complex intent and entity management. Solving for all the traditional ailments in Chatbot development frameworks.

Noisy Context

Mismatch Between Instruction & Context

Context-Based Synthesis

Unusual Formatting

Incomplete Context

In Closing

Hallucination

Formatting &Context

User Expectations

RAGged Edges: The Double-Edged Sword of Retrieval-Augmented Chatbots

Large language models (LLMs) like ChatGPT demonstrate the remarkable progress of artificial intelligence. However…

COBUS GREYLING

Where AI Meets Language | Language Models, AI Agents, Agentic Applications, Development Frameworks & Data-Centric…

x.com

Edit description

Written by Cobus Greyling

No responses yet