LLM Evals

What, why, when and how…

Louis Bouchard

Jun 24, 2024 • 7 min read

Watch the video!

We always see LLMs beating all benchmarks, like the recent mysterious GPT-2 chatbot beating all models, which was actually GPT-4o. You may have heard similar claims about some models outperforming others in popular benchmarks like those on the HuggingFace leaderboard, where models are evaluated across various tasks, but how can we determine which LLM is superior exactly? Isn’t it just generating words and ideas? How can we know one is better than the other?

Let’s answer that. I’m Louis-François, co-founder of Towards AI and today, we dive into how we can accurately quantify and evaluate the performance of these models, understand the current methodologies used for this, and discuss why this process is vital.

Let’s get started.

Why do we Evaluate LLMs?

Evaluating LLMs is crucial to identifying potential risks, analyzing how these models interact with humans, determining their capabilities and limitations for specific tasks, and ensuring that their training progresses effectively. And, most importantly, it’s vital if you want to know if you are the best!

Sounds good, evaluation is useful. But what exactly are we assessing in an LLM?

When using an LLM, we expect two things from the model:

First, it completes the assigned task, whether it is summarization, sentiment analysis, question answering or anything else LLMs can do.
Second, the model must be robust and fair. This includes its performance on unexpected or previously unseen inputs, especially those that differ significantly from its training data, and adversarial inputs designed to mislead the model, like prompt injection, which we’ve discussed in a previous article. Additionally, it’s essential to check if these extensively trained LLMs carry any inherent biases and to confirm their trustworthiness and fairness.

Now that we understand what we are evaluating let’s examine how we do that.

How do we Evaluate LLMs?

Each task we evaluate requires a benchmark tailored to that specific task. This means we need a dataset with questions and a way to compare responses (our metrics), that we calculate either automatically or using another better model like GPT-4 or paying humans to do so.

We’ll start with the most used and affordable method for benchmarking: using automated metrics and tools without human intervention. This method uses key metrics such as accuracy and calibration.

Accuracy measures how much of the response is correct. Beyond just accuracy, there are traditional metrics like F1 scores, which balance precision (how many selected items are relevant) and recall (how many relevant items are selected), used in benchmarks like SQuAD, HellaSwag, and TruthfulQA. For example, for tasks specific to LLMs, we use:
ROUGE, or Recall-Oriented Understudy for Gisting Evaluation: This metric is used for summarization tasks. It compares how many words or short phrases from the generated summary appear in the reference summaries. The more overlap there is, the better the summary is considered to be.
Levenshtein Similarity Ratio: Measures the similarity between two texts by calculating the minimum number of single-character edits (like insertions, deletions, or substitutions) required to change one word into the other.
BLEU (Bilingual Evaluation Understudy) score: Commonly used for evaluating machine translation, it calculates how many words or phrases in the machine-generated text match those in a reference translation.

Then, we have calibration.

Calibration assesses the confidence level of the model’s outputs. For example, there’s the Expected Calibration Error, which categorizes predictions by confidence. It does that by:

Grouping predictions by confidence level: Imagine the model makes a bunch of predictions and says how confident it is about each one (e.g., 70% confident, 80% confident).
We then check the actual accuracy in each group: For the predictions the model is 70% confident about, we need to confirm if about 70% of them are actually correct. We do the same for other confidence levels.
Finally, we compare confidence to actual accuracy: If the model says it’s 70% confident and 70% of its predictions are correct, the model is well-calibrated. The Expected Calibration Error measures how close the model’s confidence is to the actual accuracy across all these groups. The smaller the expected calibration error, the better the model’s confidence matches its performance.

Benchmarks such as those used in HuggingFace evaluations often include calibration metrics to assess model performance, ensuring models like those evaluated on the MMLU benchmark are accurately calibrated. For instance, MMLU, which is probably the most popular benchmark, evaluates models on a diverse set of 57 subjects, including topics like elementary mathematics, US history, computer science, and law. It uses multiple-choice questions to evaluate a model’s ability to understand and reason across these varied domains and calculates the scores automatically with stored answers.

Although automated benchmarking offers an efficient, direct and standardized testing approach, it might overlook the nuances and qualitative aspects of human evaluators’ outputs. Ideally, we also want humans in the loop. However, using humans is quite expensive, so an alternative is to use Models as Judges.

There are two primary approaches for grading: using a super powerful general model like GPT-4 or small specialist models trained on preference data.

Models like GPT-4 provide results that correlate well with human preferences, but they are often closed source, subject to API changes, and lack interpretability, which is not ideal for consistent evaluations in a benchmark.

Smaller, specialized models as judges (like sentiment classifiers) might reduce this risk due to their focused training, which can make their evaluations more consistent and interpretable, especially because you own them, but they also need rigorous testing themselves as they will be less polyvalent and powerful than a large model like GPT-4.

Regardless of the approach here, this method of using models as evaluators has limitations, such as inconsistent scoring and a tendency for models to favour themselves. Recently, a new method called G-eval has been introduced. It uses LLMs in a unique way, combining “Chain of Thought” prompting with a form-filling technique.

For example, suppose we want to evaluate how well a model summarizes a piece of text. Instead of just comparing the generated summary to a reference summary using the ROUGE metric, G-eval would ask the model to explain its reasoning for each part of the summary it generated. For instance, it might prompt, “Why did you include this specific detail?” The model would then provide its reasoning. Afterward, the model fills out a form evaluating the summary based on predetermined criteria. This dual approach ensures a more nuanced and human-like evaluation.

This method is becoming more popular because it aligns more closely with human reasoning, offering a balanced view that incorporates both the mechanical accuracy of automated metrics and the qualitative insights of human evaluations. And speaking of humans, there’s pretty much nothing better than that.

In the last approach, where Humans Judge LLMs, the evaluation focuses on how well people think the model’s results measure up in terms of quality and accuracy. Evaluators look at the outputs and consider how clear and relevant they are and how smoothly they read.

It can work in a few different ways…

One of them is Vibes-Check, where community members try out different models by giving them specific prompts to see how they respond. They might share their overall impressions or do a deeper, more detailed review, and even share their findings publicly. However, as shared in a great blog post by Clémentine Fourrier on HuggingFace, this approach is very susceptible to confirmation bias, where evaluators tend to find what they are looking for.
Another one is Community Arenas, which allows people to vote and give feedback on various models, offering a wide range of opinions and insights. People from the community simply chat with the models until they find that one is better than the others. It’s quite subjective. The votes are then compiled dynamically to create the leaderboard and have an Elo ranking system. This is where the recent GPT-4o with the secret name GPT-2 Chatbot made a lot of noise in the LMSYS chatbot arena.
Finally, we have the most obvious one, which we call Systematic Annotations. This involves (usually paid) reviewers following strict rules to try to avoid bias and keep their evaluations consistent. While this can be very thorough, it can also be quite costly. Bigger companies do that, like ScaleAI, for example.

But even better, you can blend both approaches. Tools like Dynabench use human reviewers and AI models. They keep improving the data they test the models on, making sure it stays relevant and challenging, and are usually very high-quality benchmarks.

Human evaluators are crucial because they offer a unique perspective that automatic checks can’t match. They don’t just look at whether the answer is right; they also consider whether it makes sense in context, reads well, is clear and safe, and seems in line with what humans would generally agree with or value.

The success of these human evaluations can depend on several factors, such as the number of people evaluating, their expertise in the area, how well they know the task and other details. Still, they are susceptible to biases. But so are models. We just cannot avoid that, but we can mitigate it with clear instructions and scale.

These evaluation methods mix online and offline approaches, where online evaluations interact with live data and capture real-time user feedback, and offline evaluations test LLMs against specific datasets. Both are crucial to have a good understanding of LLMs’ capabilities over time with solid baselines and flexible evolving ones.

In the end, each style has limitations. Evaluations that rely on prompts can be unpredictable because LLMs react differently to different prompts. On the other hand, the evaluators, whether they are humans or AIs, have personal biases and various ways of interpreting the tone, which can make the results inconsistent.

While the current evaluation systems are detailed, they might not fully capture LLMs’ true abilities and risks in this fast-evolving field. As we continue to explore and expand LLMs’ possibilities, it’s essential to keep improving how we evaluate them as well, and I’m quite excited to see how far this can get. It’s already quite incredible what we can do almost automatically with very little cost.

If you’re interested in learning more about how LLMs are used in real-world applications and their broader impact, be sure to check out our new book, Building LLMs for Production, where we discuss this crucial step in depth with practical examples.

Thank you for reading, and I will see you in the next one!

Resources mentioned in this article:

Watch the video!

Why do we Evaluate LLMs?

How do we Evaluate LLMs?

Sign up for more like this.