LLM Hallucination Index
Galileo released an LLM Hallucination Index, which makes for very interesting reading. The charts shared considers a Q&A use-case, with and without RAG, and also Long-Form Text Generation.
Introduction
Hallucination has become a catch-all phrase for when the model generates responses which are incorrect or fabricated. Being able to measure hallucination is a first step in managing it.
As the video shows below, it is very interesting to see what an equaliser RAG is, and how the disparity in model performance is much lower when RAG is introduced, as apposed to the absence of RAG.
What I like about this approach is the focus on different tasks LLMs may be used for, ranging from chat, to summarisation and more.
This is a practical benchmark useful for Enterprise Generative AI teams, which need to cater to the variability in task types. For instance, a model that works well for chat, might not be great at text summarisation.
The study also refers to the power of context, and that hallucination benchmarks need to take into consideration context. Retrieval augmented generation (RAG) has been popularised as an avenue to provide a contextual reference for LLMs at inference time.
Granted there is nuance with regard to the quality of the context, but measuring variability in LLM performance across RAG vs non-RAG tasks is critical.
Q&A With RAG
Retrieval-Augmented Generation (RAG) refers to a hybrid approach that combines elements of both retrieval-based information and generative knowledge from the LLM.
In the context of large language models (LLMs) like GPT-3, retrieval-augmented generation typically involves using a two-step process:
- Retrieval: The model first retrieves relevant information from a pre-existing knowledge base or external database. This retrieval step helps the model gather specific and contextually relevant information related to the given task or query.
- Generation: After retrieving relevant information, the model uses this information to generate a coherent and contextually appropriate response. This generation step allows the model to produce novel and contextually rich language based on the retrieved content.
The Analysis from Galileo:
Open AI’s GPT-4–0613 performed the best and was least likely to hallucinate for Question & Answer with RAG.
While GPT-4–0613 performed the best, the faster and more affordable GPT-3.5-turbo-0613/-1106 models performed nearly identically to GPT-4–0613.
Huggingface’s Zephyr-7b was the best-performing open-source model, outperforming Meta’s 10x larger Llama-2–70b, proving larger models are not always better.
We found TII UAE’s Falcon-40b and Mosaic ML’s MPT-7b performed worst for this task type.
Recommendation: GPT-3.5-turbo-0613
Q&A Without RAG
This task entails presenting a model with a question and relying on the LLMs internal trained knowledge without RAG or fine-tuning or any reference to external sources of information.
The Analysis from Galileo:
Open AI’s GPT-4 performed the best and was least likely to hallucinate for Question & Answer without RAG.
OpenAI’s models ranked highest for this task type, highlighting their prowess in general knowledge use cases.
Of the open-source models in the Index, Meta’s largest model, Llama 2 (70b) performed best.
Meta’s Llama-2–7b-chat and Mosaic’s ML’s MPT-7b-instruct models performed poorly and were most likely to hallucinate for this task type.
Recommendation: GPT-4–0613
Long-Form Text Generation
Employing generative AI for crafting comprehensive and cohesive textual compositions like reports, articles, essays, or narratives involves training AI models on vast datasets. This training enables the models to grasp context, uphold subject relevance, and emulate a natural writing style across extended passages.
The Analysis from Galileo:
Open AI’s GPT-4–0613 performed the best and was least likely to hallucinate for Long-form Text Generation
Open AI’s GPT-3.5-turbo-1106 and GPT-3.5-turbo-0613 both performed on par with GPT-4, with potential cost savings and performance improvement over GPT-4.
Surprisingly, Meta’s open-source Llama-2–70b-chat was on par with GPT-4, offering a cost-efficient solution for this task type.
We found TII UAE’s Falcon-40b and Mosaic ML’s MPT-7b performed worst for this task type.
Recommendation: Llama-2–70b-chat
⭐️ Follow me on LinkedIn for updates on Large Language Models ⭐️
I’m currently the Chief Evangelist @ Kore AI. I explore & write about all things at the intersection of AI & language; ranging from LLMs, Chatbots, Voicebots, Development Frameworks, Data-Centric latent spaces & more.