The absolute most important question about any test result is what it actually means. The first sentence of the first chapter of The Standards for Educational and Psychological Testing point to "the degree to which evidence and theory support the interpretations of test scores for proposed uses of tests,” and calls this validity.
To understand what a test result means, we have to understand what it measures in the aggregate—which means we have to understand what it means on the item-level. There is no magic that can make a whole measure something that the individual items do not. There’s no way to figure out what that overlap of a bunch of disparate items means, as the non-overlap creates huge errors and if you do not know what individual items measure you cannot figure out what the overlap measures.
This is the question of item alignment. What do the items—the building block of any assessment—actually measure. Do they actually measure what they are supposed to measure? How do we figure that out? What are the common pitfalls and mistakes that can undermine such investigations?
The last couple of years have seen a huge increase in interest in AI-generated items, sometimes what a human-in-the-loop and sometimes not. We’ve read papers and seen presentations, but the evaluation of what these item actually measure has been…disappointing. We’ve seen the same mistakes that novice content development professionals learn not to make repeated as though they are standard practice. For example, many AI researchers in educational measurement only evaluate the stem of a multiple choice question without considering the answer options or the cognitive paths that might lead to an incorrect answer. Again and again, researchers who do not understand how potential test takers learn particular material or the mistakes they actually make offer their less-than-expert opinions on the KSAs the an item requires.
When challenged on this, they told me that they couldn’t find anything in the literature on item alignment. So, I spent a very frustrating few weeks going through the educational measurement literature and texts to see what it had to offer on this question. And they were right. Quite a bit on blueprint- test- or form-alignment. Some dimensions of what might considered (e.g., Webb) when rolling up item alignment decisions into test alignment determinations, but nothing on how to make those item-level judgments. There simply is not a literature on item alignment.
But AI generated items are useless if they do not actually measure what they are supposed to measure. Bad building blocks cannot fulfill the requirements of test blueprints and can produce indecipherable test results. Well, they could produce fraudulent test results that simply do not report on what they claim to report, and suggest inferences for which there simply is not sufficient evidence or theory to support.
So, here is a review of item alignment. Here are the basic considerations of how to determine whether an item is aligned to its alignment reference—be it a standard, an assessment target or something else. If we going to be evaluating the potential of AI-generate items, we really need to be rigorous in our evaluation of the products they provide—the items!
Item Alignment: Understanding the Quality of the Evidence that Items Elicit
Alignment—the mapping between test items and their intended constructs—is central to test validation but remains understudied at the micro level of individual items. This paper examines how judgments about item alignment are made in practice, analyzing five common misconceptions: ignoring item modality, ignoring alternative cognitive paths, ignoring additional KSAs, lack of deep expertise with the domain model, and failing to consider the diverse range of test takers. We frame these issues using Type I (false positive) and Type II (false negative) errors in inferences about test-taker proficiency at the micro-level of individual alignment references (e.g., standards). We further explore the nature and impact of different sources of additional KSAs. The paper further examines challenges in alignment within a standard, including difficulty, learning pathways, components of complex standards, and text complexity. Despite the importance of targeting the core rather than margins of standards, numerous factors incentivize alignment with the less important margins of a standard, including ease of item development, psychometric pressures, and naïve misreadings of standards by non-experts. We argue that improved alignment requires recognizing the distinct requirements of large-scale standardized assessment and bridging disciplinary training gaps between psychometric perspectives and content development expertise to improve the quality of evidence elicited by test items.