“If 1 in 100 disease-free samples are wrongly coming up positive, the disease is not present, we call that a 1% false positive rate.
…
Because of the high false positive rate and the low prevalence, almost every positive test, a so-called case, identified by Pillar 2 since May of this year has been a FALSE POSITIVE. Not just a few percent. Not a quarter or even a half of the positives are FALSE, but around 90% of them. Put simply, the number of people Mr Hancock sombrely tells us about is an overestimate by a factor of about ten-fold. Earlier in the summer, it was an overestimate by about 20-fold.” — Dr Michael Yeadon, https://lockdownsceptics.org/lies-damned-lies-and-health-statistics-the-deadly-danger-of-false-positives/
Foreword
This is a write up on false positive, false negative, true positive, true negative thinking. It is nice to now have those words, “specificity” (true negative) and “sensitivity” (true positive) which I lacked before.
I realize this fits within the framework of signal detection theory. When I studied signal detection, I had no idea why it was considered important, particularly in Psychology, but over the years, I have often gone back to it, since it comes up routinely in a number of guises. It has to be statistical signal detection, which would be an obvious extension to binary, deterministic signal detection. This also fits within the framework of fuzzy logic, which I have looked at in years past.
Medical testing makes use of the ideas on signal detection and upon reflection, is no different than the consideration of any other sort of evidence – it has to be obtained, assessed for quality, and interpreted to understand the implications.
Anybody who tells you that such and such a test is any given percentage effective is misleading you, unintentionally or perhaps even deliberately.
There are four quadrants in the “gold standard” detection matrix: true positive, false positive, true negative, and false negative. There are at least two percentages to be considered, and they vary independently according to the bias in your test, the detection threshold decided upon. You can bias so that you find all the cases, with the consequence that you get a lot of false positives (false alarms). You can bias so that you get few false positives, and as a result, get a lot of false negatives (miss the fire). In addition, the baseline must be considered, with respect to a Bayesian statistical analysis. So, the prior information on general infection rates will make the percentages change. Low base rate of infection gives a high number of false positives. High base rate of infection gives a low number of false positives.
Bayesian reasoning: start with prior probabilities (assessed somehow) and see how probabilities change with new evidence.
If you do not have much noise masking the signal, results are easier to interpret. If you have large numbers and effects that are strong with respect to variability, the statistics should bear you out. Potentially confounding factors can be accounted for.
Introduction
It is routine for medical tests to be used to determine the health status of people. The simple view of a test is that it returns a true or false result, using some measure, some test instrument, and some testing protocol. Of course anyone reflecting on the issue even a little bit will realize that this is a much over-simplified view of things. For one thing, what threshold, what cut-off point, is being used to make the decision of true or false? What is the measure being used, and what is the measuring instrument? Most things we measure involve some sort of continuous scale. Are the measurements continuous or is it yes and no? What are the typical values that are used to make the judgement? What are the health implications? How are the numbers to be interpreted as a screening device or diagnostic tool? All of these considerations are important for understanding the test.
In this discussion I draw on ratios and proportions, odds, signal detection theory, statistics including Bayesian statistics, and simple arithmetic. I use these tools to examine the accuracy of testing.
Key Points to be Explained
Medical tests are not perfect; they give erroneous results along with correct results. We can estimate the accuracy of a test using scientific investigation. We can estimate how likely the test is to find that there is some condition (a hit). We can estimate how likely the test is to find there is not some condition (correct rejection). We can bias the test by changing the threshold, the cut-off value. We can increase hits and false alarms together, or reduce both together. In addition a low prevalence of the condition will give a lot of false alarms. A high prevalence of the condition will give a lot of misses. Also, a highly selective test will reduce the number of false alarms whereas a highly sensitive test will reduce the number of misses.
The Perfect Test
Here is a diagram which shows testing with no allowance for error.
|
|
Does the Condition Exist? |
Is the Effect Observed? |
Effect Observed |
Condition exists |
Effect Not Observed |
Condition is absent |
True or false – assuming no errors |
The above chart shows:
- Is the effect observed using the test?
- Does the condition exist in the person tested?
As a result, we have two cases for the test result:
- The condition is observed so is deemed to exist
- The condition is not observed so is deemed to not exist
The Imperfect, Real-world Test
With a bit of thought, the question of errors for the test will come up. Is the test perfect? This would seem highly unlikely.
In testing, there are two ways for the results to be true, and two ways for the results to be false.
|
|
Does The Condition Exist? |
|
|
Condition Exists |
Condition Is Absent |
Was the Effect Observed? |
Effect Observed |
Condition Is Correctly Considered To Exist
(HIT)
(TRUE POSITIVE) |
Condition Is Falsely Considered To Exist
(FALSE ALARM)
(FALSE POSITIVE) |
Effect Not Observed |
Condition Is Falsely Considered To Be Absent
(MISS)
(FALSE NEGATIVE) |
Condition Is Correctly Consider To Be Absent
(TRUE REJECTION)
(TRUE NEGATIVE) |
True or false – Assuming Errors |
- Observing an effect when the effect exists is a Hit
- Not observing an effect when the effect exists is a Miss
- Observing an effect when no effect exists is a False Alarm
- Observing an effect when no effect exist is a Correct Rejection
Synonyms for these terms are:
- Hit – True Positive (TP), Sensitivity
- False Alarm – False Positive (FP), Type I Error
- Miss – False Negative (FN), Type II Error
- Correct Rejection – True Negative (TN), Specificity
I will use the abbreviations TP, FP, FN, TN in most of the discussion, although the meanings are probably not as easily grasped.
Proportions
The above matrix may be used to show more than one thing. It can be used to show proportions, expected percentages, the odds, for each cell of the matrix for some testing scenario. It may also be used to show the expected counts for each cell, given that we have an overall count for the number of tests.
In assigning proportions to these categories, these ratios can be expressed as fractions, decimal fractions or percentages. We have the following proportions of interest:
Overall Estimates of True Percentages:
- percent of people who are actually infected
- percent of people who are truly not infected
We will make this a binary split, not allowing for degrees of infection. That latter is important, but it is not important for this discussion.
Since you don’t know the percentage of infections, you must make an estimate. How this should be done is problematic in many cases. There may be little data, and the data may be suspect.
How we arrive at these estimated percentages is complex: scientific, statistical, and not without error. It should be done independently of the test being evaluated. We call these percentages prior odds, priors, or baseline values.
Test Performance
We also need to look at the performance of a given test for classifying the results.
- Of those people who are actually infected, what percentages test as true positive (TP)?
- We can get the false negative percentage (FN), the misses, by subtracting the true positive percentage from 100 percent. This is the arithmetic complement. Conversely, if you know the percentage of false negatives, you can take the complement of the number of cases to get the percentage of true positives.
- Of those people who are truly not infected, what percentage test as true negative (TN)?
- We can get the false positive percentage (FP), the false alarms, by subtracting the true negative percentage from 100 percent. This is the arithmetic complement. Conversely, if you know the percentage of false positives, you can take the complement of the number of cases to get the percentage of true negatives
At first glance, you might think that you can apply these percentages against the whole matrix, assuming that the matrix represents 100%, and each of the four cells has some fraction, all adding up to 100%. Things don’t work that way.
The test performance, the percentages, for separating false positives from true negatives only applies to those who are uninfected. Remember, this information on overall infection rates is obtained in some other manner, including other studies, some wild-assed guess, or a deity told you.
On the other side of the matrix, the test performance, the percentages, for separating false negatives from true positives only applies to those who are infected. Remember the sources of this information laid out above.
Here is a diagram adapted from a very good tutorial on this topic. See “Confused by The Confusion Matrix: What’s the difference between Hit Rate, True Positive Rate, Sensitivity, Recall and Statistical Power?” by The Curious Learner, https://learncuriously.wordpress.com/2018/10/21/confused-by-the-confusion-matrix/
Probabilities Based on Whether or Not the Effect Exists
|
|
Does the Effect Exist? |
|
|
Effect Exists |
Effect Doesn’t Exist |
Was the Effect Observed? |
Effect Observed |
· Hit Rate
· True Positive Rate
· Sensitivity
· Statistical Power
· (1 – Beta) |
· False Alarm Rate
· False Positive Rate
· Statistical Significance
· Type I Error Rate (Alpha) |
Effect Not Observed |
· Miss Rate
· False Negative Rate
· Type II Error Rate (Beta) |
· Correct Rejection Rate
· True Negative Rate |
Testing as Evidence
Tests provide evidence. Evidence must be:
- found or produced somehow.
- assessed for reliability, quality, internal validity.
- interpreted, examined for external validity, the implications made clear.
Tests can be given scores based on sensitivity (true positives) and selectivity (true negatives). As shown above, true positives and false negatives are complements of one and other, and also, true negatives and false positives are also complements of one and other.
Incorporating the Priors
Testing must take the priors into account when the calculations are done. It makes no sense to apply percentages for false positives and true negatives against the category of estimated infected. Likewise, it makes no sense to apply percentage for true positives and false negatives against the category of estimated uninfected. The false positives and true negative test percentages are based on the uninfected. The true positives and false negative test percentages are based on the infected.
Incorporating the Case Counts
We can work with percentages, but for analysis, we really want to see actual counts. We make use of the overall number of independent tests, the priors, and the test performance to make a two by two matrix of estimated test performance.
Threshold
You can set a threshold for a test score, setting the bias point. If you set the threshold, the sensitivity, to give more hits, you will get more false alarms and miss less often. If you set the threshold, the sensitivity, to give fewer hits, you will get fewer false alarms and miss more often.
False Versus True and Priors
With a low prior rate of infection, the number of false positives can be much greater than the number of true positives, even with an accurate test.
Test Performance and Receiver Operating Characteristic (ROC)
Estimated or actual values for a given test can be plotted, putting False Positives (X) against True Positives (Y) to give a curve. This plot is called the receiver operating characteristic (ROC) curve. Any point along the curve can be selected to give a cut-off point, a threshold. If this threshold is set to detect more cases, you also get more false positives. If this threshold is set to exclude more cases, you also get fewer false positives.
The area under the ROC curve also gives a measure of accuracy. The greater the area is, the more accurate the test. Since the axis both go from 0 to 1, the maximum area is 1 squared. A diagonal line for the ROC curve gives performance at chance levels.
Not all tests are equal. Some have much better accuracy overall. The more bowed the ROC curve is above the diagonal, the better the test.
Example Data for ROC Curve
See http://www.rad.jhmi.edu/jeng/javarad/roc/JROCFITi.html
False Positive Fraction (FPF) |
True Positive Fraction (TPF) |
Lower |
Upper |
0.0000 |
0.0000 |
0.0000 |
0.0000 |
0.0050 |
0.2301 |
0.0169 |
0.7407 |
0.0100 |
0.3135 |
0.0430 |
0.7718 |
0.0200 |
0.4168 |
0.0996 |
0.8061 |
0.0300 |
0.4860 |
0.1545 |
0.8282 |
0.0400 |
0.5384 |
0.2056 |
0.8449 |
0.0500 |
0.5807 |
0.2523 |
0.8587 |
0.0600 |
0.6159 |
0.2949 |
0.8705 |
0.0700 |
0.6461 |
0.3337 |
0.8808 |
0.0800 |
0.6723 |
0.3690 |
0.8901 |
0.0900 |
0.6955 |
0.4012 |
0.8985 |
0.1000 |
0.7161 |
0.4306 |
0.9062 |
0.1100 |
0.7347 |
0.4575 |
0.9132 |
0.1200 |
0.7515 |
0.4821 |
0.9198 |
0.1300 |
0.7668 |
0.5047 |
0.9258 |
0.1400 |
0.7809 |
0.5255 |
0.9314 |
0.1500 |
0.7938 |
0.5447 |
0.9366 |
0.2000 |
0.8454 |
0.6214 |
0.9577 |
0.2500 |
0.8822 |
0.6757 |
0.9723 |
0.3000 |
0.9096 |
0.7160 |
0.9824 |
0.4000 |
0.9466 |
0.7727 |
0.9934 |
0.5000 |
0.9691 |
0.8119 |
0.9978 |
0.6000 |
0.9832 |
0.8424 |
0.9994 |
0.7000 |
0.9918 |
0.8684 |
0.9999 |
0.8000 |
0.9967 |
0.8927 |
1.0000 |
0.9000 |
0.9992 |
0.9189 |
1.0000 |
0.9500 |
0.9998 |
0.9357 |
1.0000 |
1.0000 |
1.0000 |
1.0000 |
1.0000 |
Summary ROC Statistics
Number of Cases: 50
Number Correct: 42
Accuracy: 84%
Sensitivity: 88%
Specificity: 80%
Positive Cases Missed: 3
Negative Cases Missed: 5
(A rating of 3 or greater is considered positive.)
Fitted ROC Area: 0.905
Empiric ROC Area: 0.892
Plotting the ROC Curve

ROC Curve Type: Fitted
Key for the ROC Plot
RED symbols and BLUE line: Fitted ROC curve.
GRAY lines: 95% confidence interval of the fitted ROC curve.
BLACK symbols ± GREEN line: Points making up the empirical ROC curve (does not apply to Format 5).
Testing Overall
The test can be viewed as the measure plus the measuring method. It can also include the procedures, the protocol for conducting the test. Differing protocols can change and confound the test results. Tests can be very accurate and still give a large number of false positives when the estimate of infection rates is low.
Testing Interpretation
The test results require interpretation by a skilled clinician. Sometimes, tests are used for screening, and sometimes for actual diagnosis. One test alone should not be relied upon. Tests should be repeated.
Testing Signal versus Noise
Test results for the same individual can vary because of “noise” masking the “signal.” By noise we mean fluctuations in the measurement of interest that are based on other factors than the condition of interest, perhaps random factors.
Testing and Time Variance
Test results can vary for the same individual because the underlying conditions can change from one time to the next. Levels of any condition can fluctuate over time: hourly, daily, weekly, … . With heath conditions: you get infected, you get sick, you get better, you die, … .
The Calculations of the Estimates
False Positive Calculations
The simple calculations of false positive expected rates:
- Prior baseline estimates for not infected
- Test accuracy estimates for false positive
- Number of tests
Multiply them together to get the expected count of false positives. False positives are only evaluated against the uninfected cases, not all test cases.
False Negative Calculations
Calculate the false negative expected rates
- Prior baseline estimates for not infected
- Test accuracy estimates for false negatives
- Number of tests
Multiply them together to get the expected count of false negatives. False negatives are only evaluated against the infected cases, not all test cases.
True Positive Calculations
Calculate the true positive expected rates
- Prior baseline estimates for infected
- Test accuracy estimates for true positives
- Number of tests
Multiply them together to get the expected count of true positives. True positives are only evaluated against the infected cases, not all test cases.
True Negative Calculations
Calculate the true negatives expected rates
- Prior baseline estimates for not infected
- Test accuracy estimates for true negatives
- Number of tests
Multiply them together to get the expected count of true negatives. True negatives are only evaluated against the uninfected cases, not all test cases.
An Example
In the example below, I set the following parameters:
Population Baseline Estimates |
|
Prior Baseline Infection Rate Estimate |
2 % |
Baseline True Positives =
Prior Baseline X Number of Tests |
20 |
Baseline True Negatives =
(1 – Prior Baseline) X Number of Tests |
980 |
Testing Method Performance |
|
Hit Rate (Sensitivity) |
95% |
Miss Rate =
One’s Complement of Hit Rate |
5% |
False Alarm Rate |
10% |
Correct Rejection Rate (Specificity) =
One’s Complement of False Alarm Rate |
90% |
Using these parameters, I calculate expected counts:
Expected Counts |
|
True Positives (TP) =
Baseline True Positives X
Hit Rate |
19 |
False Positives (FP) =
Baseline True Negatives X
False Alarm Rate |
98 |
False Negatives (FN) =
Baseline True Positives X
Miss Rate |
1 |
True Negatives (TN) =
Baseline True Negatives X
Correct Rejection Rate |
882 |
Summaries
I summarize the calculated values in the matrix below. You can see that the number of false positives, under these assumptions, is 5 times the amount of true positives, i.e., very high. Also, the false negative rate is very low for this test and the prior infection rates. This is with a test selectivity of 90%, a test sensitivity of 95%, and an estimated infection rate of 2%.
|
|
Does The Condition Exist? |
|
|
Condition Exists |
Condition Is Absent |
Was the Effect Observed? |
Effect Observed |
TP = 19 |
FP = 98 |
Effect Not Observed |
FN = 1 |
TN = 882 |
Estimated Counts
Based on Test Performance, Priors, and Number of Tests |
Measures of Test Performance
Below are various measures of test performance, test quality. They use the previous data from the previous example. The calculations presented here are simple. The interpretation takes more skill.
Core Set of Measures
Measures of Test Performance |
|
Diagnostic Accuracy =
(TP + TN) / TP + TN + FP + FN |
0.90 |
Sensitivity =
(TP) / (TP + FN) |
0.95 |
Specificity =
(TN) / (TN +FP) |
0.90 |
Predictive Values
Positive Predictive Value (PPV) =
(TP) / (TP + FP) |
0.16 |
Negative Predictive Value (NPV) =
(TN) / (TN + FN) |
1.00 |
The Positive Predictive Value (PPV) and the Negative Predictive Value (NPV) give the probabilities based on whether of not the effect was observed. This contrasts with the sensitivity and selectivity that give probabilities based on the estimated existence of the effect.
Probabilities Based on Whether or Not the Effect was Observed
|
|
Does the Effect Exist? |
|
|
Effect Exists |
Effect Doesn’t Exist |
Was the Effect Observed? |
Effect Observed |
True Discovery Rate
Positive Predictive Value
Precision |
False Discovery Rate |
Effect Not Observed |
False Omission Rate |
True Omission Rate
Negative Predictive Value |
Adapted from “Confused by The Confusion Matrix: What’s the difference between Hit Rate, True Positive Rate, Sensitivity, Recall and Statistical Power?” by The Curious Learner, https://learncuriously.wordpress.com/2018/10/21/confused-by-the-confusion-matrix/
Test Estimates with Differing Priors
Here are test estimates based upon five differing population baseline estimates, that is, differing estimates of priors. I vary the priors from 0.1 percent to 99.9 percent.
N.B. In order to avoid division by 0, I did not use 0.0 percent and 100 percent.
Number of Tests |
1,000 |
1,000 |
1,000 |
1,000 |
1,000 |
Testing Method |
|
|
|
|
|
Hit Rate (Sensitivity) |
95% |
95% |
95% |
95% |
95% |
Miss Rate =
Ones Complement of Hit Rate |
5% |
5% |
5% |
5% |
5% |
False Alarm Rate |
10% |
10% |
10% |
10% |
10% |
Correct Rejection Rate (Specificity) =
Ones Complement of False Alarm Rate |
90% |
90% |
90% |
90% |
90% |
|
|
|
|
|
|
Population Baseline Estimates |
|
|
|
|
|
Prior Baseline |
0.10% |
2% |
50% |
98% |
99.90% |
Baseline True Positives =
Prior Baseline X Number of Tests |
1 |
20 |
500 |
980 |
999 |
Baseline True Negatives =
(1 – Prior Baseline) X Number of Tests |
999 |
980 |
500 |
20 |
1 |
Expected Counts |
|
|
|
|
|
True Positives (TP) =
Baseline True Positives X
Hit Rate |
0.95 |
19 |
475 |
931 |
949.05 |
False Positives (FP) =
Baseline True Negatives X
False Alarm Rate |
99.9 |
98 |
50 |
2 |
0.1 |
False Negatives (FN) =
Baseline True Positives X
Miss Rate |
0.05 |
1 |
25 |
49 |
49.95 |
True Negatives (TN) =
Baseline True Negatives X
Correct Rejection Rate |
899.1 |
882 |
450 |
18 |
0.90 |
|
|
|
|
|
|
Quality Tests |
|
|
|
|
|
Diagnostic Accuracy =
(TP + TN) / TP + TN + FP + FN |
0.90 |
0.90 |
0.93 |
0.95 |
0.95 |
Sensitivity =
(TP) / (TP + FN) |
0.95 |
0.95 |
0.95 |
0.95 |
0.95 |
Specificity =
(TN) / (TN +FP) |
0.90 |
0.90 |
0.90 |
0.90 |
0.90 |
Positive Predictive Value (PPV) =
(TP) / (TP + FP) |
0.01 |
0.16 |
0.90 |
1.00 |
1.00 |
Negative Predictive Value (NPV) =
(TN) / (TN + FN) |
1.00 |
1.00 |
0.95 |
0.27 |
0.02 |
Positive Predictive Likelihood Ratios =
Sensitivity / (1 – Specificity) |
9.50 |
9.50 |
9.50 |
9.50 |
9.50 |
Negative Predictive Likelihood Ratios =
(1 – Sensitivity) / Specificity |
0.06 |
0.06 |
0.06 |
0.06 |
0.06 |
Youden’s Index =
(Sensitivity + Specificity) – 1 |
0.85 |
0.85 |
0.85 |
0.85 |
0.85 |
Diagnostic Odds Ratio (DOR) =
(TP / FN) / (FP / TN) |
171.00 |
171.00 |
171.00 |
171.00 |
171.00 |
Bibliography
- Baratloo, Alireza, Mostafa Hosseini, Ahmed Negida, and Gehad El Ashal. “Part 1: Simple Definition and Calculation of Accuracy, Sensitivity and Specificity.” Emergency 3, no. 2 (2015): 48–49.
- Harvey, Lew. “Detection Theory.” Psychology of Perception, 2014, 17.
- learncuriously. “Confused by The Confusion Matrix Part 2: ‘Accuracy’ Is But One of Many Measures of Accuracy….” The Curious Learner (blog), October 27, 2018. https://learncuriously.wordpress.com/2018/10/28/confused-by-the-confusion-matrix-part-2/.
- ———. “Confused by The Confusion Matrix: What’s the Difference between Hit Rate, True Positive Rate, Sensitivity, Recall and Statistical Power?” The Curious Learner (blog), October 20, 2018. https://learncuriously.wordpress.com/2018/10/21/confused-by-the-confusion-matrix/.
- Lockdown Sceptics. “Lies, Damned Lies and Health Statistics – the Deadly Danger of False Positives.” Accessed October 6, 2020. https://lockdownsceptics.org/lies-damned-lies-and-health-statistics-the-deadly-danger-of-false-positives/.
- Read “Intelligence Analysis: Behavioral and Social Scientific Foundations” at NAP.Edu. Accessed September 28, 2020. https://doi.org/10.17226/13062.
- “ROC Analysis: Web-Based Calculator for ROC Curves.” Accessed October 11, 2020. http://www.rad.jhmi.edu/jeng/javarad/roc/JROCFITi.html.
- “ROC Curves – What Are They and How Are They Used?” Accessed October 4, 2020. https://acutecaretesting.org/en/articles/roc-curves-what-are-they-and-how-are-they-used.
- Šimundić, Ana-Maria. “Measures of Diagnostic Accuracy: Basic Definitions.” EJIFCC 19, no. 4 (January 20, 2009): 203–11.