In this blog post, we’ll explore an example for metrics, specifically the Confusion Matrix and its related classification calculations.
When working with large language models (LLMs), the Confusion Matrix can be used to present and analyze experiment results. The numbers within the Confusion Matrix are utilized to compute classifications as Accuracy, Precision, Recall, and F1 score.
The Confusion Matrix was initial used in Machine Learning context. (Delve into additional Details)
Both the Confusion Matrix and these classifications assist us in evaluating a model’s performance when conducting experiments with large language models (LLMs).
We will discuss this along with a use case involving extracting information to determine if specific law regulations apply to a given article, such as whether a contract complies with the law.
The key distinction from other existing examples lies in the definition of the Ground Truth. In this case, we lack precise criteria for extraction or matching, and we are unsure about what is considered correct or incorrect for an individual element. Instead, we only observe the number of matching elements in a text and the criteria for defining law paragraphs. This approach is not ideal, as it can lead to difficulties in comprehension due to the lack of clear criteria.
Table of content
- The objective of the example
- Assumptions for the Ground Truth and sample size
- Example for an experiment execution
- Experiment execution result
- Examination of the experiment execution result
- Calculation of the classifications
- Classification result for the experiment execution result
- Source code for a Jupyter Notebook related to example
- Resources
- Summary
1. The objective of the example
We want to verify how well our large language model extracts the information from an article to identify which law paragraphs apply.
2. Assumptions for the Ground Truth and sample size
Let’s begin by establishing an assumption and defining the parameters for an experiment.
We utilize a nonspecific Ground Truth, indicating that we have an article and are aware that certain paragraphs in a law apply to it, without knowing exactly which part of the text is relevant in the article for a paragraph in the law to apply.
Therefore, we have a specific article and we are aware that 12 out of 25 paragraphs in the law are applicable to this article.
The image illustrates the assumptions above for the Ground Truth and the Sample size.

3. Example of an experiment execution result
We run an experiment on the LLM model of the paragraph to verify if the paragraph applies to the given article.
- Sample size
We define the 25 paragraphs as our sample size.
- Prompt
We use the article and paragraph as variables in our prompt, and the paragraphs inside the law are used to identify if a paragraph applies to the article.
In the following text, you see a very simplified prompt. This prompt provides the paragraph and the article as a variable. During an experiment, we will invoke this prompt for every paragraph of the 25 paragraphs.
You must verify does the following paragraph
{paragraph}
apply to the following article
{article}.
You must respond in the following JSON format
{ "paragraph": "STR", "applies": Boolean }
3.1 The resources of the experiment
We define that we have the following ground true and sample size.
- Ground Truth
- 1 article
- 12 applying paragraphs out of 25 paragraphs, which do represent our sample size.
- 25 sample size for the experiment execution
The image illustrates the given numbers for the resources of the experiment.

4. Experiment execution result
We didn’t conduct an experiment. We simply defined the execution result for an experiment as follows:
In an experiment using the LLM, 16 out of 25 paragraphs were found to be applied, while 9 paragraphs were found to be not applied.
- 9 not applying
- 16 applying
The image illustrates the given numbers for experiment execution result.

5. Examination of the experiment execution result
We define the examination of the execution result as follows: Verification’s with the given Ground Truth with 12 applying paragraphs.
- 9 out of the 16 identified paragraphs are included in the Ground Truth, while 12 of them are present in the given article.
- 7 of the 16 found applying paragraphs are not applying in the Ground Truth of the 12 applying paragraphs.
The image below illustrates the intersection mapping of 9 paragraphs found in the “Experiment Found” group and the “Ground Truth.”

- 6 of the 9 not applying paragraphs are not in the Ground Truth of the 12 applying paragraphs.
- 3 of the 9 not applying paragraphs are found in the Ground Truth of the 12 applying paragraphs.
The image below illustrates the intersection mapping of 3 paragraphs found in the “Experiment Found” group and the “Ground Truth” which shouldn’t match.

Based on the execution result of this experiment, we define the following input values for the confusion matrix.
- Positive
- True Positive: 9 paragraphs are identified as applying and matching the Ground Truth.
- False positive: 6 paragraphs identified as not applying and not found in the Ground Truth.
- Negative
- True negative: 7 paragraphs identified are not applying and not found in the Ground Truth.
- False negative: 3 paragraphs identified as not applying but found in the Ground Truth.
The resulting Conversion Matrix
- Red identifies the invalid values determined by the prediction in experiment execution with the LLM as in the Ground Truth or not in Ground Truth.
- Green identifies the valid values determined by the prediction in experiment execution with the LLM as in the Ground Truth or not in Ground Truth.

6. Calculation of the classifications
These are the definitions of the classifications.
- Accuracy “Accuracy is how close a given set of measurements (observations or readings) are to their true value.”
Calculation Accuracy = (True Positive + True Negative) / (True Positive + True Negative + False Positive + False Negative)
2. Precision is the proportion of all the model’s positive classifications that are actually positive.
Calculation Precision = true_positive / (true_positive + false_positive)
The image below shows a good diagram to understand the difference. The image is form the article Confusion Matrix.

3. Recall is the probability of detection.
Calculation Recall = true positive / (true_positive + false_negative)
4. F-score
Calculation F1 score = 2 * ((precision* recall) /(precision + recall))
7. Classification of the result of the example
- Accuracy: 0.64
- Precision: 0.6
- Recall: 0.75
- F1 Score: 0.6666666666666665
For this example, the experiment result is unsuitable for the accuracy ‘0.64‘, because it is < 0.75, and it should have more than 0.75, and the F1 score should be between 0.8 and 0.9 and has the value of 0.66.
8. Example source code for the calculation in a Jupyter Notebook
This source code is an example of calculating the Confusion Matrix and the classifications without using frameworks like scikit-learn. It demonstrates manual calculation and visualization of the matrix based on the experiment execution result.
%pip install --upgrade --quiet numpy
%pip install --upgrade --quiet matplotlib
import numpy as np
import matplotlib.pyplot as plt
- Execution results of an experiment
samples = 25
true_positive = 9
false_positive = 6
true_negative = 7
false_negative = 3
matrix = np.array([[true_positive, false_negative], [true_negative, false_positive]])
- Verification of the execution result and samples of the experiment
samples_calc = true_positive + false_positive + true_negative + false_negative
if samples_calc == samples:
print(f"No missing sample in the {samples_calc} calc_samples vs {samples} samples.")
else:
print(f"Missing a sample in the {samples_calc} calc_samples vs {samples} samples.")
- Overview LLM experiment results
print("Overview LLM experiment results")
print("********** Relevant numbers for the Confusion Matrix *************")
print(f"- Number of samples: {samples}")
print(f"Positive")
print(f"- True positive: {true_positive} - paragraphs are identified as applying and they match with ground truth.")
print(f"- False positive: {false_positive} - paragraphs identified as not applying and not found in the ground truth.")
print(f"Negative")
print(f"- True negative: {true_negative} - paragraphs identified are applying but found in the ground truth.")
print(f"- False negative: {false_negative} - paragraphs identified as not applying but found in the ground truth.")
- Show it in a ‘Confusion Matrix’
colors = np.array([['#43994c','#994363'],['#994381','#679943']])
fig, ax = plt.subplots()
for i in range(matrix.shape[0]):
for j in range(matrix.shape[1]):
ax.add_patch(plt.Rectangle((j,i),1,1,color=colors[i,j]))
ax.text((j+0.5),(i+0.5),matrix[j,i],ha='center',va='center',color='white', fontsize=40)
ax.set_ylim(0,matrix.shape[0])
ax.set_xlim(0,matrix.shape[1])
column_names = ["Positive (PP)","Negative(PN)"]
row_names = ["Positive","Negative"]
plt.xticks( np.arange(matrix.shape[1]) + 0.5,column_names )
plt.yticks( np.arange(matrix.shape[0]) + 0.5,row_names)
plt.gca().xaxis.set_ticks_position('top')
plt.gca().invert_yaxis()
plt.title("Confusion Matrix")
plt.show()
- Calculate the classifications
accuracy = (true_positive + true_negative) / (true_positive + true_negative + false_positive + false_negative )
recall = true_positive / (true_positive + false_negative)
precision = true_positive / (true_positive + false_positive)
f1_score = 2 * ((precision* recall) /(precision + recall))
- Show the result for the classifications
print("*********** Classifications ********************")
print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1_score}")
9. Resources
These are resources of the blog post you can use to delve into various areas.
- Confusion Matrix (Wikipedia)
- Classification (Google Developer)
- What is a good accuracy score in Machine Learning? (Deep Checks)
- F1 Score in Machine Learning (Encord)
- Understanding Confusion Matrix (Towards Data Science published in Medium)
- LLM Evaluation Metrics: The Ultimate LLM Evaluation Guide (Confident AI)
- Custom LLM Metrics in watsonx.governance (Niklas Heidloff)
- What is a sample size? (Coursera )
- The confusion matrix helps assess classification model performance in machine learning by comparing predicted values against actual values for a dataset (IBM)
10. Summary
Understanding the confusion matrix is a fundamental concept in machine learning, and it is also used when working with LLMs. It’s crucial to grasp its meaning and implications, even if it initially seems a bit confusing. It’s best to double-check its meaning more than twice.
In the example, we saw that it’s possible to use a broadly defined Ground Truth to interpret results and start a classification for your prompt and model performance even when knowing this is not an optimal approach.
This approach isn’t perfect, but it can be used if you lack detailed information. There are numerous other metrics related to LLM that you can delve in. For a more in-depth exploration, consider reading the blog post ‘Custom LLM Metrics on watsonx.governance‘ by Niklas Heidloff, or ‘LLM Evaluation Metrics: The Ultimate LLM Evaluation Guide‘ by Jeffrey Ip.
I hope this was useful to you and let’s see what’s next?
Greetings,
Thomas
#groundtruth, #ai, #accuracy, #precision, #recall, #f1score, #confusionmatrix, #classification, #truepositive, #truenegative, #falsepositive, #falsenegative
