2016’s The Standards for Educational and Psychological Testing say that that validity is “the most fundamental consideration in developing tests and evaluating tests.” This is the second sentence of the first chapter of that book. The 1999 edition said the same thing, without repeating the word “tests.” The 1985 edition agreed, but back then it was the first sentence.
Validity is the alpha and the omega. it is everything.
So, where does that leave reliablity?
The Cliché
Last week, I ran through the cliché explanation of reliability of validity with the metaphor of a target. My rude punchline was that psychometrics — being concerned with metrics (i.e., numbers) — has nothing to offer us about validity.
Because psychometrics has no way to think about validity, it doesn’t have a target at all. Rather, it just looks at how tightly clustered the hits are.
(I know about internal structure and convergent/discriminant evidence. Those are still about reliably. The latter is about reliability with other measures, but it begs the question of whether the other measures are valid. Yes, correlation with various outcomes might offer something, but that a topic for its own post.)
Generally, psychometrics has no theory, idea or vision of validity, so it raises reliability to be the more important consideration. But reliability is not the alpha and omega. It is a false god.
The Psychometric Defense
The smartest explanation for the importance of reliability that I have ever heard is that it is the upper bound or upper limit on validity. That is, in language of the cliché, you cannot hit the bullseye consistently if you cannot be consistent. You cannot hit anything consistently if you cannot be consistent.
My basic response to that is that I do not care how consistent you are if you are not near the target.
So, here’s the real question: which is better?
I acknowledge that they are both pretty damn lousy. But those who prize reliability would prefer Figure D because it is — at least — reliable. I look at Figure D and am quite sure that does not measure anything that I care about. It’s not noisy; it is just wrong.
Figure C is noisy. There are real problems. It is a lousy and unreliably measure. But at least there is some signal of what I am looking for in that noise. Sure, the confidence intervals are huge, but there is information of value in there.
Putting it in very concrete terms: I do not need another test of socioeconomic status. No one needs another test of socioeconomic status.
The Problem of Prioritizing Reliability as the Upper Bound on Validity
From what I have seen and read, the idea that reliability is the upper bound on validity has morphed into the idea that we increase validity by increasing reliability. And therefore, we can stop worrying about increasing validity because we can just focus on increasing reliability.
There are people who confuse reliability and validity. There are people who say “reliability” when they clearly ought to mean validity, but the difference is simply not important enough to them to realize that they have made a mistake.
When the means becomes the ends, what had been valuable actually comes the obstacle.
Concerns with Reliability as Obstacle to Validity
There are many causes for the quality problems with our big standardized tests. In my view, the greatest problem is that we are stuck in a vicious cycle in which perceptions by educators and the public of low quality (i.e., lack of validity) limit willingness to spend money for testing and to devote student time to testing. This harms the quality of our tests and…well…repeat.
But this is not the only problem.
The problem is that those who look to reliability as the most important consideration see moving from Figure F to Figure G as an improvement, and see moving from Figure F to Figure E as a decline. Many are always unwilling to give up reliability in order gain in validity.
There are item types that constraint reliability, ether because they take so much resources that tests must rely on fewer item or because they cannot be scores as reliably. And those item types are incredibly disfavored. Items types that simply cannot get to real cognition behind standards are not disfavored. Instead, we get highly reliable items that too often fall short of the actual targeted cognition.
How Does This Keep Happening?
Psyshometricians — with their emphasis on reliability — are high status. They have graduate degrees in measurement — perhaps even PhDs. Content development professionals (CDPs) are merely former teachers, with a ll the low status that that carries.
This status difference often prevents from even being at the table, and when they are at the table they are often overruled. When they are not overruled, they are often intimidated into relenting.
And so, psychometric concerns drive assessment development far far far more than questions about whether items actually measure what they are purported to measure.
Which clearly, in my view, violates the Standards.