Why it is a good idea to use the F1 score to qualify a Machine Learning model?

Updated: 4 days ago

One of the junior squirrel of the team was teasing Amit, our Chief Data Scientist, today about using F1, an harmonic mean, to qualify our models. Our young squirrel was saying that an harmonic mean is always emphasizing the minimum of the precision and recall, hence provides always a pessimistic scoring of the models he was training with iQC for key value extraction and classification.



Is that right? Is it correct to use F1 to characterize the quality of a model ?


Before to reply to these questions, let’s remember how we qualify an information detection model or a classification model. Most of the time, the method we use to qualify a model is named the “hold out” method or the cross validation method. As explained by Eijaz Allibhai in this post, the cross validation is just an extension of the hold-out.


So, since the question is about the interest of using an harmonic mean to calculate the F1 score, we will only consider the hold-out method to evaluate the models. To make it short, in this method, the data set of known values (the data set which has been tagged) is separated into two sets, the training set and the benchmark set (also named the test set). The split is done with a ratio of 75-85% of training data and 15-25 of benchmark data. The model is trained using the training set, then, the model is applied on the documents of the benchmark and we compare the information detected by the model against the known expected information.


Every time a value of the benchmark is expected there are 4 cases:


· The value is correctly detected, it is a True Positive (TP) case

· The value is wrong, it is a False Positive (FP) case

· The value is not detected, it is a False Negative (FN) case

· No value were to be detected and nothing has been detected, it is a True Negative (TN) case


The definition of this four terms is very well explained in this Wikipedia page: F-score - Wikipedia

For a given benchmark, it is possible to count the number of TP, FP, FN and TN and to compute two scores:


· Precision =

precision calcul


where TP is the number of True Positive cases and FP the number of False Positive cases. In other words, precision is the fraction of True Positive cases among the detected values.


· Recall = Sensitivity =



where FN is the number of False Negative. In other words, the number of True Positive case over the number of values to be detected.


At this level, we have 2 scores to qualify one model, and immediately some one may have the question: Is it better to have a high Precision or an high Recall (Sensitivity)?


Obviously, the best is to have both, and the worst is to have a model with a poor precision and a poor recall, no doubt. But what if one is good and one is poor? Could we average the two values?

Let’s do it and let’s try two kinds of averaging: The arithmetic mean and the harmonic mean:


· Arithmetic mean =


· Harmonic mean =



Considering Precision and Recall are values between 0 and 100%, the two types of mean can be displayed using the same color code for means between 0 and 100 for all the values of P and R:





We can see on the graphic above that the harmonic mean will be high (red) only if P and R are high and will be low (blue) as soon as one of the two is low. This is not the case with the arithmetic mean where a mean of 50 (white) may be obtained if one of the two is null and the second = 100.


Imagine now a COVID test which always replies ‘positive’ for any test and let’s apply this stupid test on 100 people, 5 being affected, 95 being healthy. This is purely theoretical since we don’t make COVID test at AgileDD and we have no idea about the quality of these tests. We can say that the score of this poor test will have the following metrics:

· TP = 5

· FP = 95

· FN = 0

· Precision = 5 / (5+95) = 5%

· Recall = 5 / ( 5 + 0 ) = 100% (Yes the test is stupid but always detects all the affected people!)


We can now compare the two means we have considered above


· A = Arithmetic mean = (5 + 100) / 2 = 52.5%

· H = Harmonic mean = 2*5*100/(100+5) = 9.5%


It is clear the harmonic mean, H, is better to say this covid test is poor! This harmonic mean is named F1 by the data scientists,


Therefore, it can be said that our CTO Amit and all the data scientists community has some good reasons to use the harmonic mean to calculate the F1 score, even if it seems pessimistic to some junior squirrels of the team …


OK, but now, another criticism may raise up: The F1 score give an equal importance to the Recall or to the Precision and in some case I may be more interested by one of the two. By the way, we have some customers who use iQC to populate a corporate data base and prefer an high precision than an high recall and we have some other customer more interested to extract has many data as possible to point-out some outliners. Could we use the same metric to qualify a model for these two customers?


In fact, the F1 scoring is just a specific case of the Fβ score defined as:


If it is needed to make an emphasis on the Precision like in the case of populating a corporate data-base, it is possible to replace the F1 score by a F ½ score. In the case of the poor COVID test seen above, the F ½ value will be:


F ½ = (1+ 1/4 ) * 5 * 100/(((1/4)*5)+100) = 625 / 101.25 = 6.17


F ½ is more sensitive to precision.


In case it is necessary to detect more values from a data set and we are tolerant to some false positive values, it may be recommended to replace the F1 by F2. For the bad covid test seen above:


F2 = (1+4)*5*100 / ((4*5)+100) = 2500 / 120 = 20.83


F2 is more sensitive to Recall.


A few words of history


As frequently with AI and data sciences, a lot was defined before AI became popular. In the case of the F-scoring, it is believed to have been first defined by the Dutch professor of computer science Cornelis Joost van Rijsbergen, who is viewed as one of the founding fathers of the field of information retrieval. In his 1975 book ‘Information Retrieval’ he defined a function very similar to the F-score, recognizing the inadequacy of accuracy as a metric for information retrieval systems.


The reason F1 is better than accuracy will be the subject of the next post.





90 views0 comments