Has IRT Destroyed Old School Validity?

When I first learned about the measurement idea of validity, I was taught that it is about whether the measure actually measures the construct. I was taught that validity pairs with reliability, which is about how consistent the measure is. Reliability is like the margin of error of from one measurement to the next, but validity is whether you’re aiming of the right target. I have had this definition in my head for…decades? I think I first learned about this in a psychology class in the 1980’s. 

When I came to the field of educational measurement this century, I found a somewhat different definition. Since the 1999 Standards for educational and psychological assessment (AERA, APA, NCME), in educational measurement validity is about whether there is evidence to support the use of a test or measurement for a particular use. We all stress that validity is no longer a property of the test itself, but rather is a property of the particular test use. And there are different kinds of evidence that can contribute to this idea of test validity.

That shift to attention on particular test uses is really important. When tests are repurposed, they might no longer be appropriate. For example, a very good test of 4th grade mathematics is simply not a good test of 6th grade mathematics. It is not that the test has changed, but rather that its use has changed. So, the old validity evidence for the old use is no longer germane. 

I buy that. I really do. But I still have in my head the issue of the basic inference. That is, totally apart from test use, does the test actually measure what it claims to measure? Are the inferences we make about test takers…valid? I think that that still matters.

In fact, I think that whether the tests measure what they are supposed to measure is the real point. I think that that old school idea of validity as simply the question about whether the test measures what it is supposed to measure is critically important. If it does then appropriate uses are kinda of obvious. And inappropriate uses are also kinda obvious.

So why the shift from the 1985 Standards to the 1999 Standards?

I have a theory that is probably incorrect. But it’s in my head.

For decades, the statistics behind big standardized tests has been based on something called IRT (item response theory) and before that it was based on CTT (classical test theory). Each of these generally reports a single score that is useful for sorting and ranking test takers. No matter how many different elements the test is supposed to measure—like the different standards in a domain model—they each report a single unified score. However, for them to work reliably, test developers remove potential test items that seem like they might be measuring something a little different than the other items. So, the better that each item does at measuring its targeted standard, the less likely that item is to be included. The more an item instead kinda measures some muddled middle idea of the construct, the more likely it is to be selected. Psychometricians call that “model fit,” and the model is usually unidimensional IRT or CTT. 

When there is a conflict between a multi-dimensional domain model (e.g., the different knowledge, skills and abilities that go into a set of standards) and a unidimensional psychometric model, modern educational measurement favors the unidimensional model—throwing aside items that better fit the domain model than the psychometric model.

As a content person, I have never been able to figure out what that mushy middle means. On a 4th grade math test, it’s some vague idea of math ability…but it’s not clear which aspect of math ability factor in and which do not. It might include ability to learn math. But how much? It might include computational accuracy. But how much? It might include problem solving ability, but how much? Or even reading ability! Because model fit statistics lead to the removal of really strongly targeted items (i.e., as opposed to items that lean towards the muddled middle), I don’t think we could ever know.

These technique produce a seemingly definitive ranking of test takers with seemingly definitive quantitative scores—often to many decimal places. But it is never clear what they are ranked on. Something about math…but what? They most definitely are not a thoughtfully weighed composite score when IRT is combined with item selection and model fit statistics. 

Which takes me back to the question of old school validity vs. news school educational measurement test validity. Was the change necessary simply because we never know what IRT is scoring students on, from a content perspective? That is, IRT results are not interpretable through the lens of the construct, so we not longer focus on the inference?

That’s what I am thinking about, these days.