Source

Understanding and Remembering Precision and Recall

Kadam Parikh
The Startup
Published in
13 min readJul 30, 2020

--

Hello folks, greetings. So, maybe you are thinking what’s so hard in precision and recall? Why just another article on this topic?

I recommend reading this article with patience and a note and pencil in hand. Also, concentrate… Reread the same lines if needed.

I have hard time in remembering things. I tend to forget things that I haven’t used for a while. I tend to forget the FORMULAS of Precision and Recall over time.

BUT, I have a tendency to remake things up in my mind. In the high school, I was having hard time cramming things up. I couldn’t remember formulas for a long period. So, what I did was, understanding them in natural language (for ex: English). And then, during my exams, I would simply recreate the formula from my understanding. Such an ability also allowed me, at times, to invent new formulas. Actually, that wasn’t any kind of invention but it was specialization. But then, I was a kid at that time, right!! So, let’s keep that “invention” ;)

Now, you might be thinking that “I am not here to hear your story”. But I am here to make you hear my story XD. Just Kidding! Let’s start..

So, let’s understand Precision and Recall in an intuitive manner. And then, you won’t need to Google up every time what they mean and how are they formulated.

Mostly, you might be aware to the terms TP, FP, TN and FN. But I have habit of explaining thoroughly. So, maybe you should skip that section if you know it.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — -

TP, FP, TN and FN

Assume that you are performing a classification task. Let us keep it very simple. Suppose you are performing a single label image classification. This means that, the image belongs to one and only one of the given classes. Also, let’s make it simpler. Consider that there is only one class.

Now, if you don’t know the difference between single label and multi label classification, just google a bit.

So, you are now performing binary image classification. For example, the task of whether an image contains a dog or not, belongs to this category.

So, there are two target labels depending on if the predicted value is 1 or 0: dog and not dog. Consider being a dog as “positive” (1) and not being a dog as “negative” (0). In short, define positive as one of the two classes and negative as the other (leftover) class.

Now, you input an image to the model and the model predicts that the image is of a dog. This means that the model is “positive” that there is a dog. Now, the case is, the image isn’t actually of a dog. The image is of a person and not of a dog. Hence, the output of model is wrong. Wrong means “false”. This is an example to false positive.

Suppose, that image actually contained a dog. Then, the model was correct. Correct means “true”. Now this became an example to true positive.

So, true positive means that the model is positive and is correct. And false positive means that the model is positive but is wrong/incorrect.

Same goes for true negative and false negative. If the model predicts that there is no dog (i.e. negative) but, actually there is a dog, then the model is wrong. This becomes a case of false negative. Similarly, if the model predicted that there is no dog and the image actually doesn’t contain a dog, then the model is correct. This is a case of true negative.

So, you guys got an idea of these terms. Let’s extend this for the whole training data instead of a single image. Suppose, you are classifying 100 images. The model classified 70 images correctly and 30 images incorrectly. Kudos! You now have a 70% accurate model.

Now, let’s focus on the correct images, i.e. TRUE classifications. Suppose, 20 of the 70 correctly classified images were not of dog, i.e. they were NEGATIVES. In this case, the value of TRUE NEGATIVES is 20. And hence, the value of TRUE POSITIVES is 50.

Now, consider the case of incorrectly classified images, i.e. FALSE classifications. Suppose, 10 images out of the 30 incorrectly classified images are of dogs i.e. POSITIVE. Then the value of FALSE POSITIVES became 10. Similarly, the value of FALSE NEGATIVES becomes 20.

Now, let’s add up. TP + FP + TN + FN = 50 + 20 + 20 + 10 = 100 = size of training data.

Remember: Positive/Negative refers to the prediction made by the model. And True/False refers to the evaluation of that prediction i.e. if the prediction made is correct (true) or incorrect (false).

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — —-

So, now that you have understood these terms, let’s shift on to precision and recall. If you have read other articles in past, you might be thinking, what about the confusion matrix? Are you going to skip it? Maybe yes?! Maybe not! See, confusion matrix are, too confusing. The only reason they are needed, or the only reason they are included as a part of precision-recall articles, is that they help with the formulation of precision and recall.

And as I said earlier, I am too bad at remembering formulas. So, let’s just invent (create) them.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — —-

Introduction

What does precision mean to you? Actually, the term precision is context dependent. Like, it depends on the task that you are performing. Whether you are solving a math problem, or you are performing image classification, or you are performing object detection, the term precision has different meanings in all the contexts. The current object detection metrics are just stumbled ones. They still use the same formula and then use additional calculations on precision and recall. No comments on that part. I ain’t a researcher, and hence, I can’t comment on how they calculate metrics.

For those who don’t know what do I mean by metrics, go Google it..

So, for now, let’s understand the meaning of precision in the mostly used formulation. Precision just calculates how precise your model is. Above I mentioned that your model is 70% accurate. But can you answer how precise it was? No..

Accuracy, here, means the percentage of images correctly classified by the model. So, what does precision mean?

The thing is, as I said, the concept of precision is context dependent. But, you are lucky enough that for evaluating ML models, the concept remains same throughout. But then, to understanding precision in an intuitive manner, you will need to understand at first “why do you need precision”?

Actually, you might have read several articles online related to what is precision and recall. But none of them clearly mentions why do you need them. Yes, there are separate articles covering this topic. I will share a reference at the end. But let me try to explain you here about why do you need precision and recall. I believe in completeness ;)

Actually, the reference that I am going to share is quite good. Also, it lists the formulas of precision and recall. But, does it make you understand the formulas? No.. it just mentions that this is the formula and you have will to cram it. So, stay focused here XD

Why do you need recall?

“Are you serious? What about Precision? You just skipped all the things related to precision and directly jumped on to recall..” Yeah, I hear you. But, just wait and watch -_-.

The real question is, “I have accuracy. My model is 99% accurate. Why do I still need recall?”. Now, this depends on the task you perform. If you are classifying whether the image is of a dog (positive class) or not of a dog (negative class), then accuracy is all you need. But, if you are classifying whether the person is infected by COVID-19 (-\_/-) or not, then you will need something else than accuracy. Let’s understand this with an example.

Suppose, you have 100 images to classify and task is to predict if it is a dog or not. Now, the model classified 9 images as positive and 91 images as negative.

Suppose, the values of TP, FP, TN and FN are 9, 0, 90, 1 respectively.

Note that TP + FP = Positives = 9 and TN + FN = Negatives = 91.

That means, the model correctly classified 99 images out of 100. Note that correct implies true and trues = TP + TN = 9 + 90 = 99. That is, 99% accuracy.

Here, the model miss-classified 1 image. Maybe, because it didn’t learn the features properly or maybe there’s some another reason like unbalanced dataset or something. But the thing to note is, that the model did miss-classify 1 image.

If you don’t know what an unbalanced dataset means, and how can an unbalanced dataset cause such issues, Google it. Also, refer the references I share at the end.

You can do 99 things for someone and all they’ll remember is the one thing you didn’t do.

Remember the quote? Yes.. and we are going to do the same with our model. We are going to look at that 1 miss-classified image. Consider the task now. If we miss-classify an image as not a dog, how will it impact the users? It won’t, right? Or maybe just a little. Now, suppose the task was classifying if the image captured using CCTV in a small town contained a lion or not. And if there was a lion, alert all the citizens of town to be aware and hide themselves. Now, if the model miss-classified an image of lion, then it would have a huge impact on citizens.

Consider a more serious task. Classifying if the person is infected by COVID-19 or not. If he/she is affected, alert the emergency staff and quarantine him/her. What if that infected person is not quarantined? The virus would spread, right? The impact here of wrong/false classification is huge. Hence, even if the model is 99% accurate and it only miss-classified 1% of data, we will still tell the model that you made a mistake and ask it to improve.

Hence, we need something more than accuracy. And that metric is called recall. Now, in order to know how recall helps here, we will need to understand what is recall.

Remember.. You haven’t yet understood Precision. I skipped that part :(

Recall

What do you mean by recall in simple terms? Forget about AI/ML etc. What you mean by “I am trying to recall but I can’t”? Or “let me try to recall what happened”. Does “recall” equals “think”? No.. it’s “remember”. Actually, recall and remember, these two words have a slight difference in their meaning but are mostly the same. In both of the above two sentences, you can replace recall with remember and it would work fine.

So, recall = remember.

The thing here is, our model needs to recall if the features of a person indicates that he/she COVID-19 positive. Our model needs to remember the features of COVID-19 positive class such that it does not miss-classify a COVID-19 positive case as negative.

Recall can then be defined as, the number of positive classes correctly classified (remembered/recalled) by the model divided by total number of positive classes. Suppose, there are 50 positive classes in the dataset. Now, on running predictions on this dataset, model only predicts correctly 20 positive classes. This means that the model is only able to correctly remember 20 positive classes out of 50. And hence, the recall is 40%. (20/50 = 0.4)

Such a model predicting COVID-19 positive cases won’t work. Because, it is marking 60% COVID-19 positive cases as negative. And this number (60%) is too high to ignore.

So, recall = number of positive classes correctly predicted by the model / total number of positive classes.

The number of classes correctly (true) classified as positive equals TP. The total number of positive classes in the dataset equals TP + FN. Because, FN means that the model said “negative” and the model is “wrong”. Hence, it was actually “positive”.

That means, the invented formula is:
recall = TP / (TP + FN)

Hence, “How is the recall of the model?” will simply answer the question “How many of the total positive datapoints (images) are correctly remembered by the model?”

Total positive datapoints = TP + FN

Because, TP = Model predicts that the datapoint is positive and the model is correct i.e. datapoint is indeed positive.

And, FN = Model predicts that the datapoint is negative and the model is wrong here i.e. datapoint is positive.

Also, datapoints correctly remembered by the model = TP + TN
That is, positive datapoints correctly remembered by the model = TP

Finally, recall = positive datapoints correctly remembered / total positive datapoints = TP / (TP + FN)

So, remember that recall answers the question — How many of the total positive datapoints did the model correctly remember? Or, How well does the model recall positive datapoints?

Wait.. what about TN and FP? Also, I have wrote “predicts correctly positive classes” all the time. So, what about the other cases? Like the case here “predicts incorrectly negative classes” i.e. classifying a person who is not infected with COVID-19 as positive. This became an example of FP. The model said that the person is infected but he/she isn’t. Now, does that matter? How much does it impact to quarantine a person who is not infected? A little yes? So, we can ignore it. Also, TN should be ignored as the prediction is true (correct).

Why do you need precision?

I said that, if a person who isn’t infected with COVID-19 is predicted as infected (positive), then it does not matter. And you blindly believed me!

But but but.. What if you are living in North Korea? You will be shot dead if you are detected positive. “What the hell…. That’s a high impact. You can’t just ignore this. I want to live man!!” Yeah.. I hear these words too. So, that’s the reason you need precision.

There’s another reason too. What if I simply ask the model to classify all the images as positive? In this case, TP = x, FP = 100 - x (if size of dataset is 100), TN = 0 and FN = 0. Recall in this case would be, recall = 1 i.e. 100%.

What the heck!!! This means that, we will shoot every human in North Korea as the model will classify all the citizens of North Korea as COVID-19 Positive and also, we trust the model because recall is 100%. Like seriously!!!

That is one other reason why we need precision.

The things went in this order:
1. Only accuracy won’t work in certain tasks
2. We need recall
3. Only recall won’t work
4. We need precision along with recall

Precision

Ahh.. now you know why I skipped precision. But remember, I have also skipped confusion matrix as it was too confusing.

At this stage, you should already know that precision will have something to do with FP. If you haven’t guessed this, go re-read the above two sections.

Consider the last example where model was simply classifying all the citizens as COVID-19 positive. In this case, though the recall of model is high (100%), the precision of model is very low. Hence, as with other topics in Machine Learning, here too, there is a trade-off. Just like bias-variance trade-off, there is precision-recall trade-off.

After reading this article, I need you to prove mathematically about why there’s a trade-off in precision and recall. And yeah.. Google a bit too. If you are successful, then leave a comment here of the method you used to prove this.

So, we need the model to also take care of “not miss-classifying negative samples” i.e. not marking an uninfected (negative) person as infected (positive).

We can do this by defining precision as the number of correct positive cases divided by the number of predicted positive cases. For example, if the number of positive cases in the dataset is 50 and the model predicts that the number of positive cases is 80. Now, out of these 80 cases, only 20 predicitons are correct and other 60 are incorrect. That means, 20 cases are predicted positive and correct i.e. TP = 20. And 60 cases are predicted positive but are incorrect i.e. FP = 60.

As you can see, the model is not at all precise. The model says that 80 cases are positive out of which only 20 cases are actually positive. Here, precision = 20/80 = 25% .

We simply formulated precision above. Precision = TP / (TP + FP)

Understanding this in an intuitive way, “How precise your model is?” answers the question “How many datapoints are actually positive out of the total number of predicted positive datapoints?”

So, remember that precision answers the question — How many of the claimed (predicted) positive datapoints are actually positive? Or, How precise is the model in predicting positive datapoints?

Conclusion

Both the definitions of precision and recall matches their meaning in English.

Like, how many positive datapoints (out of the total number of positive datapoints) does the model remember? — Recall

And, how many (of the total predicted positive datapoints) are actually positive? — Precision

If you just understand what these two questions mean, you can then rebuild the formulas whenever you need them. If you don’t understand these questions clearly, try to translate them in your local language (mine is Gujarati) and you will be able to understand it.

Wait wait.. is it going to end? What about the confusion matrix?

Confusion matrix is just used in order to visualize all these things and help you cram the formulas. I won’t cover it! But yes, I will help you cram formulas using confusion matrix here.

Here is an image that will help you cram the denominators of precision and recall formulas. The numerator being same is TP.

Cancer being the positive class instead of COVID-19

What more? Nothing.. Maybe what I have written is too confusing. Maybe it is not. I don’t know. Just leave your comments, bad or good, so that I can know.

But yeahh.. something more to do by yourself. Go read about F1 score and why you need it? — short answer — because of the trade-off between precision and recall. How would you select a model? Based on precision? Or based on recall? The answer is F1 score. Go read it..

The reference I promised is here:
- https://towardsdatascience.com/beyond-accuracy-precision-and-recall-3da06bea9f6c

What more? Read about ROC curve, mAP and AR. Or wait for me to post about it.. Bye!

Don’t forget to give us your 👏 !

--

--

Kadam Parikh
The Startup

Machine Learning | Deep Learning | Cyber Security | “The best approach to learning is teaching”