Land of Confusion using Classifications, and Metrics for a nonspecific Ground Truth

In this blog post, we’ll explore an example for metrics, specifically the Confusion Matrix and its related classification calculations.

When working with large language models (LLMs), the Confusion Matrix can be used to present and analyze experiment results. The numbers within the Confusion Matrix are utilized to compute classifications as Accuracy, Precision, Recall, and F1 score.

The Confusion Matrix was initial used in Machine Learning context. (Delve into additional Details)

Both the Confusion Matrix and these classifications assist us in evaluating a model’s performance when conducting experiments with large language models (LLMs).

We will discuss this along with a use case involving extracting information to determine if specific law regulations apply to a given article, such as whether a contract complies with the law.

The key distinction from other existing examples lies in the definition of the Ground Truth. In this case, we lack precise criteria for extraction or matching, and we are unsure about what is considered correct or incorrect for an individual element. Instead, we only observe the number of matching elements in a text and the criteria for defining law paragraphs. This approach is not ideal, as it can lead to difficulties in comprehension due to the lack of clear criteria.

Table of content

The objective of the example
Assumptions for the Ground Truth and sample size
Example for an experiment execution
Experiment execution result
Examination of the experiment execution result
Calculation of the classifications
Classification result for the experiment execution result
Source code for a Jupyter Notebook related to example
Resources
Summary

1. The objective of the example

We want to verify how well our large language model extracts the information from an article to identify which law paragraphs apply.

2. Assumptions for the Ground Truth and sample size

Let’s begin by establishing an assumption and defining the parameters for an experiment.

Ground Truth

We utilize a nonspecific Ground Truth, indicating that we have an article and are aware that certain paragraphs in a law apply to it, without knowing exactly which part of the text is relevant in the article for a paragraph in the law to apply.

Therefore, we have a specific article and we are aware that 12 out of 25 paragraphs in the law are applicable to this article.

The image illustrates the assumptions above for the Ground Truth and the Sample size.

3. Example of an experiment execution result

We run an experiment on the LLM model of the paragraph to verify if the paragraph applies to the given article.

Sample size

We define the 25 paragraphs as our sample size.

Prompt

We use the article and paragraph as variables in our prompt, and the paragraphs inside the law are used to identify if a paragraph applies to the article.

In the following text, you see a very simplified prompt. This prompt provides the paragraph and the article as a variable. During an experiment, we will invoke this prompt for every paragraph of the 25 paragraphs.

You must verify does the following paragraph
{paragraph}
apply to the following article
{article}.
You must respond in the following JSON format
{ "paragraph": "STR", "applies": Boolean }

3.1 The resources of the experiment

We define that we have the following ground true and sample size.

Ground Truth
- 1 article
- 12 applying paragraphs out of 25 paragraphs, which do represent our sample size.
25 sample size for the experiment execution

The image illustrates the given numbers for the resources of the experiment.

4. Experiment execution result

We didn’t conduct an experiment. We simply defined the execution result for an experiment as follows:

In an experiment using the LLM, 16 out of 25 paragraphs were found to be applied, while 9 paragraphs were found to be not applied.

9 not applying
16 applying

The image illustrates the given numbers for experiment execution result.

5. Examination of the experiment execution result

We define the examination of the execution result as follows: Verification’s with the given Ground Truth with 12 applying paragraphs.

9 out of the 16 identified paragraphs are included in the Ground Truth, while 12 of them are present in the given article.
7 of the 16 found applying paragraphs are not applying in the Ground Truth of the 12 applying paragraphs.

The image below illustrates the intersection mapping of 9 paragraphs found in the “Experiment Found” group and the “Ground Truth.”

6 of the 9 not applying paragraphs are not in the Ground Truth of the 12 applying paragraphs.
3 of the 9 not applying paragraphs are found in the Ground Truth of the 12 applying paragraphs.

The image below illustrates the intersection mapping of 3 paragraphs found in the “Experiment Found” group and the “Ground Truth” which shouldn’t match.

Based on the execution result of this experiment, we define the following input values for the confusion matrix.

Positive
- True Positive: 9 paragraphs are identified as applying and matching the Ground Truth.
- False positive: 6 paragraphs identified as not applying and not found in the Ground Truth.
Negative
- True negative: 7 paragraphs identified are not applying and not found in the Ground Truth.
- False negative: 3 paragraphs identified as not applying but found in the Ground Truth.

The resulting Conversion Matrix

Red identifies the invalid values determined by the prediction in experiment execution with the LLM as in the Ground Truth or not in Ground Truth.
Green identifies the valid values determined by the prediction in experiment execution with the LLM as in the Ground Truth or not in Ground Truth.

6. Calculation of the classifications

These are the definitions of the classifications.

Accuracy “Accuracy is how close a given set of measurements (observations or readings) are to their true value.”

Calculation Accuracy = (True Positive + True Negative) / (True Positive + True Negative + False Positive + False Negative)

2. Precision is the proportion of all the model’s positive classifications that are actually positive.

Calculation Precision = true_positive / (true_positive + false_positive)

The image below shows a good diagram to understand the difference. The image is form the article Confusion Matrix.

3. Recall is the probability of detection.

Calculation Recall = true positive / (true_positive + false_negative)

4. F-score

“F1 score is a machine learning evaluation metric that measures a model’s accuracy. It combines the precision and recall scores of a model.”

Calculation F1 score = 2 * ((precision* recall) /(precision + recall))

7. Classification of the result of the example

Accuracy: 0.64
Precision: 0.6
Recall: 0.75
F1 Score: 0.6666666666666665

For this example, the experiment result is unsuitable for the accuracy ‘0.64‘, because it is < 0.75, and it should have more than 0.75, and the F1 score should be between 0.8 and 0.9 and has the value of 0.66.

This source code is an example of calculating the Confusion Matrix and the classifications without using frameworks like scikit-learn. It demonstrates manual calculation and visualization of the matrix based on the experiment execution result.

%pip install --upgrade --quiet numpy
%pip install --upgrade --quiet matplotlib

import numpy as np
import matplotlib.pyplot as plt

Execution results of an experiment

samples = 25
true_positive = 9
false_positive = 6
true_negative = 7
false_negative = 3
matrix = np.array([[true_positive, false_negative], [true_negative, false_positive]])

Verification of the execution result and samples of the experiment

samples_calc = true_positive + false_positive + true_negative + false_negative
if samples_calc == samples:
 print(f"No missing sample in the {samples_calc} calc_samples vs {samples} samples.")
else:
 print(f"Missing a sample in the {samples_calc} calc_samples vs {samples} samples.")

Overview LLM experiment results

print("Overview LLM experiment results")
print("********** Relevant numbers for the Confusion Matrix *************")
print(f"- Number of samples: {samples}")
print(f"Positive")
print(f"- True positive: {true_positive} - paragraphs are identified as applying and they match with ground truth.")
print(f"- False positive: {false_positive} - paragraphs identified as not applying and not found in the ground truth.")
print(f"Negative") 
print(f"- True negative: {true_negative} - paragraphs identified are applying but found in the ground truth.")
print(f"- False negative: {false_negative} - paragraphs identified as not applying but found in the ground truth.")

Show it in a ‘Confusion Matrix’

colors = np.array([['#43994c','#994363'],['#994381','#679943']])
fig, ax = plt.subplots()
for i in range(matrix.shape[0]):
    for j in range(matrix.shape[1]):
 ax.add_patch(plt.Rectangle((j,i),1,1,color=colors[i,j]))
 ax.text((j+0.5),(i+0.5),matrix[j,i],ha='center',va='center',color='white', fontsize=40)
        
ax.set_ylim(0,matrix.shape[0])
ax.set_xlim(0,matrix.shape[1])
column_names = ["Positive (PP)","Negative(PN)"]
row_names = ["Positive","Negative"]

plt.xticks( np.arange(matrix.shape[1]) + 0.5,column_names )
plt.yticks( np.arange(matrix.shape[0]) + 0.5,row_names)

plt.gca().xaxis.set_ticks_position('top')
plt.gca().invert_yaxis()
plt.title("Confusion Matrix")
plt.show()

Calculate the classifications

accuracy = (true_positive + true_negative) / (true_positive + true_negative + false_positive + false_negative )
recall = true_positive / (true_positive + false_negative)
precision = true_positive / (true_positive + false_positive)
f1_score  = 2 * ((precision* recall) /(precision + recall))

Show the result for the classifications

print("*********** Classifications  ********************")
print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1_score}")

9. Resources

These are resources of the blog post you can use to delve into various areas.

Confusion Matrix (Wikipedia)
Classification (Google Developer)
What is a good accuracy score in Machine Learning? (Deep Checks)
F1 Score in Machine Learning (Encord)
Understanding Confusion Matrix (Towards Data Science published in Medium)
LLM Evaluation Metrics: The Ultimate LLM Evaluation Guide (Confident AI)
Custom LLM Metrics in watsonx.governance (Niklas Heidloff)
What is a sample size? (Coursera )
The confusion matrix helps assess classification model performance in machine learning by comparing predicted values against actual values for a dataset (IBM)

10. Summary

Understanding the confusion matrix is a fundamental concept in machine learning, and it is also used when working with LLMs. It’s crucial to grasp its meaning and implications, even if it initially seems a bit confusing. It’s best to double-check its meaning more than twice.

In the example, we saw that it’s possible to use a broadly defined Ground Truth to interpret results and start a classification for your prompt and model performance even when knowing this is not an optimal approach.

This approach isn’t perfect, but it can be used if you lack detailed information. There are numerous other metrics related to LLM that you can delve in. For a more in-depth exploration, consider reading the blog post ‘Custom LLM Metrics on watsonx.governance‘ by Niklas Heidloff, or ‘LLM Evaluation Metrics: The Ultimate LLM Evaluation Guide‘ by Jeffrey Ip.

I hope this was useful to you and let’s see what’s next?

Greetings,

Thomas

#groundtruth, #ai, #accuracy, #precision, #recall, #f1score, #confusionmatrix, #classification, #truepositive, #truenegative, #falsepositive, #falsenegative

Land of Confusion using Classifications, and Metrics for a nonspecific Ground Truth

1. The objective of the example

2. Assumptions for the Ground Truth and sample size

3. Example of an experiment execution result

3.1 The resources of the experiment

4. Experiment execution result

5. Examination of the experiment execution result

6. Calculation of the classifications

7. Classification of the result of the example

9. Resources

10. Summary

2 thoughts on “Land of Confusion using Classifications, and Metrics for a nonspecific Ground Truth”

Add yours

Leave a comment Cancel reply

Blog Stats

1. The objective of the example

2. Assumptions for the Ground Truth and sample size

3. Example of an experiment execution result

3.1 The resources of the experiment

4. Experiment execution result

5. Examination of the experiment execution result

6. Calculation of the classifications

7. Classification of the result of the example

8. Example source code for the calculation in a Jupyter Notebook

9. Resources

10. Summary

Share this:

Related

2 thoughts on “Land of Confusion using Classifications, and Metrics for a nonspecific Ground Truth”

Add yours

Leave a comment Cancel reply

Blog Stats