How Should Large Language Models Be Evaluated?

A recent study focussed on the evaluation of AI models in general and LLMs in specific. This paper also serves as a comprehensive overview of benchmark studies…

5 min readNov 6, 2023

The study also distills the process of evaluating LLMs into three main categories; the what, the where and the how. These principles are also important when designing a LLM-based application.

Three Key Principles of LLM Applications

Granted, the aim of the paper is to consider the testing task, data and process. But the principles highlighted in the paper also serve quite well as a reference for application development.

Some organisations and products focus on one of three elements; task, data or process (flow)…others focus on the orchestration of all three of these.

This three-point approach also serves as a guide for makers to be guided on where they need to focus attention and resources.

Task — The What

The what to evaluate encapsulates the evaluation of existing tasks of a LLM. This necessitates a healthy design thinking process of what are we wanting to achieve.

Within the ambit of testing, first the task which needs to be evaluated must be well defined, what is the purpose of the task and how will this goal be achieved.

When teams decide to build a LLM-based application, or any generative AI application, what needs to be considered is what problems must the application solve for and what will success look like?

Effectively testing and measuring are premised on knowing the what…what problem are we solving for, what does success look like? And equally important, what does failure look like?

One could argue, starting with the what in the product life-cycle and considering these testing bench-marks will assist in creating a valuable product.

Data — The Where

The where to evaluate involves selecting appropriate datasets and benchmarks for evaluation. This data also needs to be managed.

Again this testing principle can be applied in designing and building a LLM-based application.

What data will be used, and how will this data be delivered?

Considering the building of an LLM-based application, the knowledge of the LLM can be used; the base knowledge the LLM is trained on.

The LLM can also be fine-tuned with company or industry specific data, or highly contextual data can be referenced at inference via prompt injection.

Process — The How

The process of evaluation can be automated, or semi-automated; the Human-In-The-Loop approach can also have varying degrees of human-level involvement.

The same principle applies to the process followed in implementing an application, will a prompt-pipeline be used, or prompt chaining, or a level of autonomous agents and tools? There are also option for agent assist, humans augmented by AI.

More On LLMs

Transformers have revolutionised the efficient handling of sequential data, making it possible to capture long-range dependencies within text effectively.

One of the prominent characteristics of Large Language Models (LLMs) lies in their ability to learn in context during inference.

This entails training the model to generate text, based on a provided context or prompt.

This capacity empowers LLMs to produce more coherent and contextually relevant responses, rendering them highly suitable for interactive and conversational applications.

Reinforcement Learning from Human Feedback stands as another pivotal aspect of LLMs. This technique involves refining the model by using human-generated responses as rewards, enabling the model to learn from its mistakes and enhance its performance progressively.

In the context of the table presented below, the comparison of traditional Machine Learning, Deep Learning, and LLMs across six essential elements is very insightful.

As seen in the LLM column, interpretability is a challenge with LLMs with high model complexity.

As depicted in the image blow, the study explores LLM evaluation in three dimensions and challenges can be considered as a fourth.

Not only do these three dimensions serve as integral components to the evaluation of LLMs, it also serves as a reference for conceptualising and planning LLM-based products.

Final Thoughts

There is not one large language model to rule them all, it seems evident that model orchestration will become increasingly important.

The study considers the current ecosystem to discover new trends and protocols and propose new challenges and opportunities.

⭐️ Follow me on LinkedIn for updates on Conversational AI ⭐️

I’m currently the Chief Evangelist @ Kore AI. I explore & write about all things at the intersection of AI and language; ranging from LLMs, Chatbots, Voicebots, Development Frameworks, Data-Centric latent spaces & more.

Get an email whenever Cobus Greyling publishes.

Get an email whenever Cobus Greyling publishes. By signing up, you will create a Medium account if you don’t already…

cobusgreyling.medium.com

LLM Alignment, Hallucination & Misinformation

This study yet again shows the importance of data discovery, data design & data delivery to the LLM; all with human…

cobusgreyling.medium.com