The Most Dangerous Idea in Large-Scale Assessment?

Perhaps the most dangerous idea in large scale assessment is the idea that items assess standards, as opposed to assessing test takers. So much sloppiness and inappropriate inferences (and uses) of tests — which means they are not valid tests! – are caused by that enormous mistake.

If items directly assessed standards (or KSAs), then it would be possible to examine an item to see whether it is aligned to the standard without considering test takers. Item developers could just think about the ideal of an item and the ideal of an item that is aligned to a particular standard. They would not need to know, understand or think deeply about students.

But items actually assess students’ proficiencies. They are about test takers’ cognition — which is why they are called cognitive assessments. And test takers vary. They vary a lot. They vary in proficiency, in background, in experiences and in how they were taught. They vary in they command of other proficiencies. One might be a good reader and another a poor reader, making the word problem a very different challenge for two student with similar arithmetic ability. 

We say that valid item elicits evidence of the targeted cognition for the range of the typical test takers. We take that idea that test takers vary very seriously. There is a range of typical test takers for an items — a multi-dimensional range. 

Different test takers might find a different entry point into an item. They might have a different first thought. They might have a different initial guess. They might have a different next cognitive step, after that initial guess. They might consciously apply different strategies, or be differently aware of how they are getting to their answers. Because they have had different teachers who used different explanations or or different examples, two test takers can view the exact same item with different views of its novelty — and all that that implies about finding a path to a successful response. 

Test takers vary.

But little in item development training or item development practices dives into how test takers vary. There is little — and usually none — documentation about the different ways in which a standard is taught or the different kind of common mistakes and/or misunderstandings that potential test takers have with the standard. 

Instead, we too often rely upon one adult’s view of what the most likely reaction and cognitive path towards a solution might be — too often done without thinking and without the appropriate humility that there are many other potential reactions and paths.

In spite of all of this, people expect individualized score reports and people make individual inferences about test taker capabilities based on a test that assumes that test takers all react and think the same — an assumption that is logically at odds with the range of standarsd and the idea that different kids will get different items wrong. 

This idea that items assessment standards is particularly ill-suited to be paired with the expectation that these tests can deliver useful information about individual test takers. Even without that expectation, the frequent mismatch between real live test takers and the assumed aristotelean ideal of a test taker means that even the aggregate results rarely well reflect reality across the tested population.