The Original Sin of Large Scale Educational Assessment

The Standards for Educational and Psychological Testing explain five "sources of validity evidence,” on pages 13-21.

  • Evidence Based on Test Content

  • Evidence Based on Response Processes

  • Evidence Based on Internal Structure

  • Evidence Based on Relations to Other Variables

  • Evidence for Validity and Consequences of Testing 

Only one of these is really about even moderately sophisticated psychometrics: Evidence Based on Internal Structure. The others are either content based or rely on other sorts of statistical techniques. But evidence based on internal structure gets at some real issues in psychometrics. It is easy to understand, as it has the shortest explanation of the five potential sources of validity evidence. For example, the first of its three paragraphs says:

Analyses of the internal structure of a test can indicate the degree to which the relationships among test items and test components conform to the construct on which the proposed test score interpretations are based. The conceptual framework for a test may imply a single dimension of behavior, or it may posit several components that are each expected to be homogeneous, but that are also distinct from each other. For example, a measure of discomfort on a health survey might assess both physical and emotional health. The extent to which item interrelationships bear out the presumptions of the framework would be relevant to validity (p. 16).

And yet, the practice of developing, administering and reporting large scale standardized educational assessment seems to have mostly abandoned this form of validity evidence—the only form that really gets at psychometric issues. 

Straightforward examination of domain models (e.g., state learning standards) immediately reveals that these tests are supposed to measure multi-dimensional constructs. Those who know the constructs and content areas best are quite clear that these constructs (i.e., content areas) are multidimensional, with different students doing better in some areas and worse in others. They require an array of different sorts of lessons and ought to be measured with an array of different sorts of questions. 

I was taught that this kind of psychometric analysis is really about factor analysis of some sort. Which items tend to lean into which factors—dimensions—and then qualitative content-based analysis to confirm that this is as it should be. Heck, the basic question of whether the hypothetical dimensionality of the construct is reflected in the empirical dimensionality of the instrument…well, I was taught that that is really important. And The Standards seems to say that, too. 

But instead of ensuring that the dimensionality of the instrument matches the dimensionality of the domain model, the dominant mode in large scale educational assessment has an almost knee-jerk reliance on unidimensional models. Heck, items that fail to conform to this demand are discarded, as model fit statistics are the ultimate determinant of whether they can be included on a test (form). Such statistics are used to ensure that the dimensionality of the instruments does not match that of the construct. 

This use of such statistics combine with the use of unidimensional models to ensure that tests are not valid, by design. It ensures that domain models will be reread, reinterpreted and selected from only insofar as they can support the psychometric model. The tail wags the dog. 

There are many issues with large scale assessment that cause educators, learners, parents and the public to view them as “the enemy,as Steve Sireci observed in his 2020 NCME Presidential Address. But if I had to pick the single most important one, this would be it. Multiple choice items are problematic, but it quite often is possible to write good multiple choice items that i) reflect the content of the domain model, ii) prompt appropriate response processes, iii) combine for an internal structure that resembles that of the domain model, iv) combine to have appropriate relationships to other variables, and v) support appropriate inferences and consequences. But none of those are possible while insisting that items and tests are not allowed to match the structure of the domain model. This is not simply about ignoring the domain model, as some sort of neglect. Rather, this is active hostility that affirmatively bars using it as the primary reference for test development.   

Looking for DIF or other invariances that suggest fairness issues is not enough, so long as the structure of the domain model itself is barred from properly influencing test construction, as The Standards say it should.

To state this more plainly, this practice sets psychometric considerations as the main obstacle to developing valid tests—or tests that can be put to any valid use or purpose.