The misunderstood relationship between validity and reliability


Perhaps the single biggest foundational misunderstanding in the entire field of educational measurement is the emphasis put on the reliability of tests. While this is not the source of all problems in standardized assessment, there aren’t that many untouched by it.

Reliability and validity are technical terms in statistics and educational measurement. Their technical meanings are related to their meanings in more common usage, but they overlap with each other less in their technical applications. 

Reliability refers to the consistency of a measurement. It does not have to be correct, it just has to be consistent. A bathroom scale that is always off by 7.2 pounds or always off 5.85% is perfectly reliable, even if it is wrong. Because it is consistently wrong, it is reliable. Similarly, colorblind people can be technically reliable in their reporting of colors, simply for their consistency. Therefore, reliability is not a reference to substantive trustworthiness. Reliable does not mean "never wrong," rather it includes "consistently wrong.”

In statistics, validity refers to measuring the correct thing. A instrument that is designed to measure body fat but instead just measures waist size is not very valid, regardless of its reliability. An instrument said to measure socioeconomic status but only measures educational attainment is not very valid, regardless of its reliability. Unfortunately, using the wrong tool to measure a proxy or an inadequate proxy to stand in for a construct is far too common.

In educational measurement, validity is so important that it now has multiple meanings and components. But test validity refers to whether the there is evidence (and theory) to use the test results for a particular purpose. Put more colloquially,  “Is there a good basis to use that test for that purpose.” But a little more practically, “Are the inferences and decisions made based upon this test actually justified by this test.” This stems from the statistical definitions, but focuses on the use of a test, rather than the test itself. After all, a test that was valid for an old purpose is not necessarily still valid for a new use.

Unfortunately, validity—be it statistical or educational measurement’s test validity—cannot be quantified. But there are seemingly countless statistical techniques for calculating reliability. Hence, statisticians and psychometricians focus far more on reliability than on validity. Their tools and quantitative mindset push them to what is—for them—a more attractive concept. Numbers! Quantification!

The foundational psychometric mistake is that they behave as though—and perhaps believe that—reliability is a sort of fertilizer for validity. That is, in practice the mistaken disciplinary view seems to be that reliability leads to validity. But that has the causal relationship backwards. In fact, validity leads to reliability. But validity is not the only factor that can lead to reliability, and that is where the problems come in. Efforts to increase reliability can be orthogonal to validity, or even come at the expense of validity.

The foundational psychometric mistake is that its typical approaches seem to rest on a belief that reliability is a sort of fertilizer for validity. That is, in practice the mistaken disciplinary view seems to be that reliability leads to validity. But that has the causal relationship backwards. In fact, validity leads to reliability. But validity is not the only factor that can lead to reliability, and that is where the problems come in. Efforts to increase reliability can be orthogonal to validity, or even come at the expense of validity.

Then, we see Campbell’s Law at work. In theory, psychometrics attempts to use reliability as a proxy for statistical validity, and therefore as the kind of evidence that supports test validity. But by maximizing reliability without focusing on statistical validity, reliability can instead be counter-indicative of validity. For example, use of point bi-serial correlations and other item discrimination statistics generally rest on uni-dimensional assumptions which undermine the measurement of multi-dimensional constructs or domain models. The attention paid to reliability actually distorts its meaning and degrades its value as evidence of validity.

Maximizing validity (i.e., the most fundamental consideration in developing tests and evaluating tests) is a worthy goal. Sonny, validity is the greatest thing in the world—except for a nice MLT (mutton lettuce and tomato sandwich) where the mutton is nice and lean and the tomato is ripe. They're so perky. I love that.

But this kind of focus on reliability? It is bad for measurement, bad for tests and bad for students.