True and False – Key Aspects of Medical Tests

“If 1 in 100 disease-free samples are wrongly coming up positive, the disease is not present, we call that a 1% false positive rate.

Because of the high false positive rate and the low prevalence, almost every positive test, a so-called case, identified by Pillar 2 since May of this year has been a FALSE POSITIVE. Not just a few percent. Not a quarter or even a half of the positives are FALSE, but around 90% of them. Put simply, the number of people Mr Hancock sombrely tells us about is an overestimate by a factor of about ten-fold. Earlier in the summer, it was an overestimate by about 20-fold.” — Dr Michael Yeadon, https://lockdownsceptics.org/lies-damned-lies-and-health-statistics-the-deadly-danger-of-false-positives/

Foreword

This is a write up on false positive, false negative, true positive, true negative thinking. It is nice to now have those words, “specificity” (true negative) and “sensitivity” (true positive) which I lacked before.

I realize this fits within the framework of signal detection theory. When I studied signal detection, I had no idea why it was considered important, particularly in Psychology, but over the years, I have often gone back to it, since it comes up routinely in a number of guises. It has to be statistical signal detection, which would be an obvious extension to binary, deterministic signal detection. This also fits within the framework of fuzzy logic, which I have looked at in years past.

Medical testing makes use of the ideas on signal detection and upon reflection, is no different than the consideration of any other sort of evidence – it has to be obtained, assessed for quality, and interpreted to understand the implications.

Anybody who tells you that such and such a test is any given percentage effective is misleading you, unintentionally or perhaps even deliberately.

There are four quadrants in the “gold standard” detection matrix: true positive, false positive, true negative, and false negative. There are at least two percentages to be considered, and they vary independently according to the bias in your test, the detection threshold decided upon. You can bias so that you find all the cases, with the consequence that you get a lot of false positives (false alarms). You can bias so that you get few false positives, and as a result, get a lot of false negatives (miss the fire). In addition, the baseline must be considered, with respect to a Bayesian statistical analysis. So, the prior information on general infection rates will make the percentages change. Low base rate of infection gives a high number of false positives. High base rate of infection gives a low number of false positives.

Bayesian reasoning: start with prior probabilities (assessed somehow) and see how probabilities change with new evidence.

If you do not have much noise masking the signal, results are easier to interpret. If you have large numbers and effects that are strong with respect to variability, the statistics should bear you out. Potentially confounding factors can be accounted for.

Introduction

It is routine for medical tests to be used to determine the health status of people. The simple view of a test is that it returns a true or false result, using some measure, some test instrument, and some testing protocol. Of course anyone reflecting on the issue even a little bit will realize that this is a much over-simplified view of things. For one thing, what threshold, what cut-off point, is being used to make the decision of true or false? What is the measure being used, and what is the measuring instrument? Most things we measure involve some sort of continuous scale. Are the measurements continuous or is it yes and no? What are the typical values that are used to make the judgement? What are the health implications? How are the numbers to be interpreted as a screening device or diagnostic tool? All of these considerations are important for understanding the test.

In this discussion I draw on ratios and proportions, odds, signal detection theory, statistics including Bayesian statistics, and simple arithmetic. I use these tools to examine the accuracy of testing.

Key Points to be Explained

Medical tests are not perfect; they give erroneous results along with correct results. We can estimate the accuracy of a test using scientific investigation. We can estimate how likely the test is to find that there is some condition (a hit). We can estimate how likely the test is to find there is not some condition (correct rejection). We can bias the test by changing the threshold, the cut-off value. We can increase hits and false alarms together, or reduce both together. In addition a low prevalence of the condition will give a lot of false alarms. A high prevalence of the condition will give a lot of misses. Also, a highly selective test will reduce the number of false alarms whereas a highly sensitive test will reduce the number of misses.

The Perfect Test

Here is a diagram which shows testing with no allowance for error.

Does the Condition Exist?
Is the Effect Observed? Effect Observed Condition exists
Effect Not Observed Condition is absent
True or false – assuming no errors

The above chart shows:

  1. Is the effect observed using the test?
  2. Does the condition exist in the person tested?

As a result, we have two cases for the test result:

  1. The condition is observed so is deemed to exist
  2. The condition is not observed so is deemed to not exist

The Imperfect, Real-world Test

With a bit of thought, the question of errors for the test will come up. Is the test perfect? This would seem highly unlikely.

In testing, there are two ways for the results to be true, and two ways for the results to be false.

Does The Condition Exist?
Condition Exists Condition Is Absent
Was the Effect Observed? Effect Observed Condition Is Correctly Considered To Exist

 

 

(HIT)

(TRUE POSITIVE)

Condition Is Falsely Considered To Exist

 

 

(FALSE ALARM)

(FALSE POSITIVE)

Effect Not Observed Condition Is Falsely Considered To Be Absent

 

 

(MISS)

(FALSE NEGATIVE)

Condition Is Correctly Consider To Be Absent

 

 

 

(TRUE REJECTION)

(TRUE NEGATIVE)

True or false – Assuming Errors
  1. Observing an effect when the effect exists is a Hit
  2. Not observing an effect when the effect exists is a Miss
  3. Observing an effect when no effect exists is a False Alarm
  4. Observing an effect when no effect exist is a Correct Rejection

Synonyms for these terms are:

  1. Hit – True Positive (TP), Sensitivity
  2. False Alarm – False Positive (FP), Type I Error
  3. Miss – False Negative (FN), Type II Error
  4. Correct Rejection – True Negative (TN), Specificity

I will use the abbreviations TP, FP, FN, TN in most of the discussion, although the meanings are probably not as easily grasped.

Proportions

The above matrix may be used to show more than one thing. It can be used to show proportions, expected percentages, the odds, for each cell of the matrix for some testing scenario. It may also be used to show the expected counts for each cell, given that we have an overall count for the number of tests.

In assigning proportions to these categories, these ratios can be expressed as fractions, decimal fractions or percentages. We have the following proportions of interest:

Overall Estimates of True Percentages:

  1. percent of people who are actually infected
  2. percent of people who are truly not infected

We will make this a binary split, not allowing for degrees of infection. That latter is important, but it is not important for this discussion.

Since you don’t know the percentage of infections, you must make an estimate. How this should be done is problematic in many cases. There may be little data, and the data may be suspect.

How we arrive at these estimated percentages is complex: scientific, statistical, and not without error. It should be done independently of the test being evaluated. We call these percentages prior odds, priors, or baseline values.

Test Performance

We also need to look at the performance of a given test for classifying the results.

  1. Of those people who are actually infected, what percentages test as true positive (TP)?
  2. We can get the false negative percentage (FN), the misses, by subtracting the true positive percentage from 100 percent. This is the arithmetic complement. Conversely, if you know the percentage of false negatives, you can take the complement of the number of cases to get the percentage of true positives.
  3. Of those people who are truly not infected, what percentage test as true negative (TN)?
  4. We can get the false positive percentage (FP), the false alarms, by subtracting the true negative percentage from 100 percent. This is the arithmetic complement. Conversely, if you know the percentage of false positives, you can take the complement of the number of cases to get the percentage of true negatives

At first glance, you might think that you can apply these percentages against the whole matrix, assuming that the matrix represents 100%, and each of the four cells has some fraction, all adding up to 100%. Things don’t work that way.

The test performance, the percentages, for separating false positives from true negatives only applies to those who are uninfected. Remember, this information on overall infection rates is obtained in some other manner, including other studies, some wild-assed guess, or a deity told you.

On the other side of the matrix, the test performance, the percentages, for separating false negatives from true positives only applies to those who are infected. Remember the sources of this information laid out above.

Here is a diagram adapted from a very good tutorial on this topic.  See “Confused by The Confusion Matrix: What’s the difference between Hit Rate, True Positive Rate, Sensitivity, Recall and Statistical Power?” by  The Curious Learner, https://learncuriously.wordpress.com/2018/10/21/confused-by-the-confusion-matrix/

Probabilities Based on Whether or Not the Effect Exists

Does the Effect Exist?
Effect Exists Effect Doesn’t Exist
Was the Effect Observed? Effect Observed ·         Hit Rate

·         True Positive Rate

·         Sensitivity

·         Statistical Power

·         (1 – Beta)

·         False Alarm Rate

·         False Positive Rate

·         Statistical Significance

·         Type I Error Rate (Alpha)

Effect Not Observed ·         Miss Rate

·         False Negative Rate

·         Type II Error Rate (Beta)

·         Correct Rejection Rate

·         True Negative Rate

Testing as Evidence

Tests provide evidence. Evidence must be:

  1. found or produced somehow.
  2. assessed for reliability, quality, internal validity.
  3. interpreted, examined for external validity, the implications made clear.

Tests can be given scores based on sensitivity (true positives) and selectivity (true negatives). As shown above, true positives and false negatives are complements of one and other, and also, true negatives and false positives are also complements of one and other.

Incorporating the Priors

Testing must take the priors into account when the calculations are done. It makes no sense to apply percentages for false positives and true negatives against the category of estimated infected. Likewise, it makes no sense to apply percentage for true positives and false negatives against the category of estimated uninfected. The false positives and true negative test percentages are based on the uninfected. The true positives and false negative test percentages are based on the infected.

Incorporating the Case Counts

We can work with percentages, but for analysis, we really want to see actual counts. We make use of the overall number of independent tests, the priors, and the test performance to make a two by two matrix of estimated test performance.

Threshold

You can set a threshold for a test score, setting the bias point. If you set the threshold, the sensitivity, to give more hits, you will get more false alarms and miss less often. If you set the threshold, the sensitivity, to give fewer hits, you will get fewer false alarms and miss more often.

False Versus True and Priors

With a low prior rate of infection, the number of false positives can be much greater than the number of true positives, even with an accurate test.

Test Performance and Receiver Operating Characteristic (ROC)

Estimated or actual values for a given test can be plotted, putting False Positives (X) against True Positives (Y) to give a curve. This plot is called the receiver operating characteristic (ROC) curve. Any point along the curve can be selected to give a cut-off point, a threshold. If this threshold is set to detect more cases, you also get more false positives. If this threshold is set to exclude more cases, you also get fewer false positives.

The area under the ROC curve also gives a measure of accuracy. The greater the area is, the more accurate the test. Since the axis both go from 0 to 1, the maximum area is 1 squared. A diagonal line for the ROC curve gives performance at chance levels.

Not all tests are equal. Some have much better accuracy overall. The more bowed the ROC curve is above the diagonal, the better the test.

Example Data for ROC Curve

See http://www.rad.jhmi.edu/jeng/javarad/roc/JROCFITi.html

False Positive Fraction (FPF)  True Positive Fraction (TPF) Lower Upper
0.0000 0.0000 0.0000 0.0000
0.0050 0.2301 0.0169 0.7407
0.0100 0.3135 0.0430 0.7718
0.0200 0.4168 0.0996 0.8061
0.0300 0.4860 0.1545 0.8282
0.0400 0.5384 0.2056 0.8449
0.0500 0.5807 0.2523 0.8587
0.0600 0.6159 0.2949 0.8705
0.0700 0.6461 0.3337 0.8808
0.0800 0.6723 0.3690 0.8901
0.0900 0.6955 0.4012 0.8985
0.1000 0.7161 0.4306 0.9062
0.1100 0.7347 0.4575 0.9132
0.1200 0.7515 0.4821 0.9198
0.1300 0.7668 0.5047 0.9258
0.1400 0.7809 0.5255 0.9314
0.1500 0.7938 0.5447 0.9366
0.2000 0.8454 0.6214 0.9577
0.2500 0.8822 0.6757 0.9723
0.3000 0.9096 0.7160 0.9824
0.4000 0.9466 0.7727 0.9934
0.5000 0.9691 0.8119 0.9978
0.6000 0.9832 0.8424 0.9994
0.7000 0.9918 0.8684 0.9999
0.8000 0.9967 0.8927 1.0000
0.9000 0.9992 0.9189 1.0000
0.9500 0.9998 0.9357 1.0000
1.0000 1.0000 1.0000 1.0000

Summary ROC Statistics

Number of Cases:   50

Number Correct:    42

Accuracy:          84%

Sensitivity:       88%

Specificity:       80%

Positive Cases Missed:  3

Negative Cases Missed:  5

(A rating of 3 or greater is considered positive.)

Fitted ROC Area:   0.905

Empiric ROC Area:  0.892

Plotting the ROC Curve

ROC Curve Type:   Fitted

Key for the ROC Plot

RED symbols and BLUE line:  Fitted ROC curve.
GRAY lines:  95% confidence interval of the fitted ROC curve.
BLACK symbols ± GREEN line:  Points making up the empirical ROC curve (does not apply to Format 5).

Testing Overall

The test can be viewed as the measure plus the measuring method. It can also include the procedures, the protocol for conducting the test. Differing protocols can change and confound the test results. Tests can be very accurate and still give a large number of false positives when the estimate of infection rates is low.

Testing Interpretation

The test results require interpretation by a skilled clinician. Sometimes, tests are used for screening, and sometimes for actual diagnosis. One test alone should not be relied upon. Tests should be repeated.

Testing Signal versus Noise

Test results for the same individual can vary because of “noise” masking the “signal.”  By noise we mean fluctuations in the measurement of interest that are based on other factors than the condition of interest, perhaps random factors.

Testing and Time Variance

Test results can vary for the same individual because the underlying conditions can change from one time to the next. Levels of any condition can fluctuate over time: hourly, daily, weekly, … . With heath conditions: you get infected, you get sick, you get better, you die, … .

The Calculations of the Estimates

False Positive Calculations

The simple calculations of false positive expected rates:

  1. Prior baseline estimates for not infected
  2. Test accuracy estimates for false positive
  3. Number of tests

Multiply them together to get the expected count of false positives. False positives are only evaluated against the uninfected cases, not all test cases.

False Negative Calculations

Calculate the false negative expected rates

  1. Prior baseline estimates for not infected
  2. Test accuracy estimates for false negatives
  3. Number of tests

Multiply them together to get the expected count of false negatives. False negatives are only evaluated against the infected cases, not all test cases.

True Positive Calculations

Calculate the true positive expected rates

  1. Prior baseline estimates for infected
  2. Test accuracy estimates for true positives
  3. Number of tests

Multiply them together to get the expected count of true positives. True positives are only evaluated against the infected cases, not all test cases.

True Negative Calculations

Calculate the true negatives expected rates

  1. Prior baseline estimates for not infected
  2. Test accuracy estimates for true negatives
  3. Number of tests

Multiply them together to get the expected count of true negatives. True negatives are only evaluated against the uninfected cases, not all test cases.

An Example

In the example below, I set the following parameters:

Number of Tests 1,000.00

 

Population Baseline Estimates
Prior Baseline Infection Rate Estimate 2 %
Baseline True Positives =
Prior Baseline X Number of Tests
20
Baseline True Negatives =
(1 – Prior Baseline) X Number of Tests
980

 

Testing Method Performance
Hit Rate (Sensitivity) 95%
Miss Rate =
One’s Complement of Hit Rate
5%
False Alarm Rate 10%
Correct Rejection Rate (Specificity) =
One’s Complement of False Alarm Rate
90%

 

Using these parameters, I calculate expected counts:

Expected Counts
True Positives (TP) =
Baseline True Positives X
Hit Rate
19
False Positives (FP) =
Baseline True Negatives X
False Alarm Rate
98
False Negatives (FN) =
Baseline True Positives X
Miss Rate
1
True Negatives (TN) =
Baseline True Negatives X
Correct Rejection Rate
882

Summaries

I summarize the calculated values in the matrix below. You can see that the number of false positives, under these assumptions, is 5 times the amount of true positives, i.e., very high. Also, the false negative rate is very low for this test and the prior infection rates. This is with a test selectivity of 90%, a test sensitivity of 95%, and an estimated infection rate of 2%.

Does The Condition Exist?
Condition Exists Condition Is Absent
Was the Effect Observed? Effect Observed TP = 19 FP = 98
Effect Not Observed FN = 1 TN = 882
Estimated Counts

Based on Test Performance, Priors, and Number of Tests

Measures of Test Performance

Below are various measures of test performance, test quality. They use the previous data from the previous example. The calculations presented here are simple. The interpretation takes more skill.

Core Set of Measures

Measures of Test Performance
Diagnostic Accuracy =
(TP + TN) / TP + TN + FP + FN
0.90
Sensitivity =
(TP) / (TP + FN)
0.95
Specificity =
(TN) / (TN +FP)
0.90

Predictive Values

Positive Predictive Value (PPV) =
(TP) / (TP + FP)
0.16
Negative Predictive Value (NPV) =
(TN) / (TN + FN)
1.00

The Positive Predictive Value (PPV) and the Negative Predictive Value (NPV) give the probabilities based on whether of not the effect was observed. This contrasts with the sensitivity and selectivity that give probabilities based on the estimated existence of the effect.

Probabilities Based on Whether or Not the Effect was Observed

Does the Effect Exist?
Effect Exists Effect Doesn’t Exist
Was the Effect Observed? Effect Observed True Discovery Rate

Positive Predictive Value

Precision

False Discovery Rate
Effect Not Observed False Omission Rate True Omission Rate

Negative Predictive Value

Adapted from “Confused by The Confusion Matrix: What’s the difference between Hit Rate, True Positive Rate, Sensitivity, Recall and Statistical Power?” by  The Curious Learner, https://learncuriously.wordpress.com/2018/10/21/confused-by-the-confusion-matrix/

Test Estimates with Differing Priors

Here are test estimates based upon five differing population baseline estimates, that is, differing estimates of priors. I vary the priors from 0.1 percent to 99.9 percent.

N.B. In order to avoid division by 0, I did not use 0.0 percent and 100 percent.

Number of Tests 1,000 1,000 1,000 1,000 1,000
Testing Method
Hit Rate (Sensitivity) 95% 95% 95% 95% 95%
Miss Rate =
Ones Complement of Hit Rate
5% 5% 5% 5% 5%
False Alarm Rate 10% 10% 10% 10% 10%
Correct Rejection Rate (Specificity) =
Ones Complement of False Alarm Rate
90% 90% 90% 90% 90%
Population Baseline Estimates
Prior Baseline 0.10% 2% 50% 98% 99.90%
Baseline True Positives =
Prior Baseline X Number of Tests
1 20 500 980 999
Baseline True Negatives =
(1 – Prior Baseline) X Number of Tests
999 980 500 20 1
Expected Counts
True Positives (TP) =
Baseline True Positives X
Hit Rate
0.95 19 475 931 949.05
False Positives (FP) =
Baseline True Negatives X
False Alarm Rate
99.9 98 50 2 0.1
False Negatives (FN) =
Baseline True Positives X
Miss Rate
0.05 1 25 49 49.95
True Negatives (TN) =
Baseline True Negatives X
Correct Rejection Rate
899.1 882 450 18 0.90
Quality Tests
Diagnostic Accuracy =
(TP + TN) / TP + TN + FP + FN
0.90 0.90 0.93 0.95 0.95
Sensitivity =
(TP) / (TP + FN)
0.95 0.95 0.95 0.95 0.95
Specificity =
(TN) / (TN +FP)
0.90 0.90 0.90 0.90 0.90
Positive Predictive Value (PPV) =
(TP) / (TP + FP)
0.01 0.16 0.90 1.00 1.00
Negative Predictive Value (NPV) =
(TN) / (TN + FN)
1.00 1.00 0.95 0.27 0.02
Positive Predictive Likelihood Ratios =
Sensitivity / (1 – Specificity)
9.50 9.50 9.50 9.50 9.50
Negative Predictive Likelihood Ratios =
(1 – Sensitivity) / Specificity
0.06 0.06 0.06 0.06 0.06
Youden’s Index =
(Sensitivity + Specificity) – 1
0.85 0.85 0.85 0.85 0.85
Diagnostic Odds Ratio (DOR) =
(TP / FN) / (FP / TN)
171.00 171.00 171.00 171.00 171.00

 

 

 

Bibliography

  1. Baratloo, Alireza, Mostafa Hosseini, Ahmed Negida, and Gehad El Ashal. “Part 1: Simple Definition and Calculation of Accuracy, Sensitivity and Specificity.” Emergency 3, no. 2 (2015): 48–49.
  2. Harvey, Lew. “Detection Theory.” Psychology of Perception, 2014, 17.
  3. learncuriously. “Confused by The Confusion Matrix Part 2: ‘Accuracy’ Is But One of Many Measures of Accuracy….” The Curious Learner (blog), October 27, 2018. https://learncuriously.wordpress.com/2018/10/28/confused-by-the-confusion-matrix-part-2/.
  4. ———. “Confused by The Confusion Matrix: What’s the Difference between Hit Rate, True Positive Rate, Sensitivity, Recall and Statistical Power?” The Curious Learner (blog), October 20, 2018. https://learncuriously.wordpress.com/2018/10/21/confused-by-the-confusion-matrix/.
  5. Lockdown Sceptics. “Lies, Damned Lies and Health Statistics – the Deadly Danger of False Positives.” Accessed October 6, 2020. https://lockdownsceptics.org/lies-damned-lies-and-health-statistics-the-deadly-danger-of-false-positives/.
  6. Read “Intelligence Analysis: Behavioral and Social Scientific Foundations” at NAP.Edu. Accessed September 28, 2020. https://doi.org/10.17226/13062.
  7.  “ROC Analysis: Web-Based Calculator for ROC Curves.” Accessed October 11, 2020. http://www.rad.jhmi.edu/jeng/javarad/roc/JROCFITi.html.
  8. “ROC Curves – What Are They and How Are They Used?” Accessed October 4, 2020. https://acutecaretesting.org/en/articles/roc-curves-what-are-they-and-how-are-they-used.
  9. Šimundić, Ana-Maria. “Measures of Diagnostic Accuracy: Basic Definitions.” EJIFCC 19, no. 4 (January 20, 2009): 203–11.

 

 

 

3 thoughts on “True and False – Key Aspects of Medical Tests

Leave a Reply

Your email address will not be published. Required fields are marked *