Haladyna et al.’s item writing rules project might have begun as merely a literature review, but by the time any of it was first published (Haladyna & Downing, 1989), it claimed to be “as a complete and authoritative set of guidelines for writing multiple-choice items.” The 2002 update (Haladyna, Downing & Rodriguez)—the last that claims to be grounded in the work of others—says “for classroom assessment” in the title, but then says, “This taxonomy may also have usefulness for developing test items for large-scale assessments,” in the article’s abstract. Their project has always been quite ambitious, intended to be far more than merely a literature review.

I have no problem with this ambition. The first fundamental problem with the Haladyna Rules is the arrogance. They do not explain reasoning for each rule, and sometimes even go against the literature. Ambition is offering such expansive guidance, arrogance is doing so when you’ve never actually done the work professionally, yourself. Ambition is trying to cover both classroom assessment and large scale standardized assessment, arrogance is mistaking your own opinion and/or preferences for reasoning and/or evidence.

The second fundamental problem with the Haladyna Rules project is that it displays profound ignorance about how items work or what their purpose is. Yes, the two articles (1989, 2002) are rather oriented towards reporting on what others’ have written, but their two major books (2004, 2013) are not. In either case, they could have explained their rules in terms of item functionality and purpose, but they do not. Are these rules focused on items ability to provide the intended information? Or, as we say so very often, valid items elicit evidence of the targeted cognition for the range of typical test takers. Their rules are focused on clarity and even presentation. The feel incredibly shallow, as though little consideration is given to content, and only a hand wave to test takers.

The third problem with the Haladyna rules is their organization. Content, Formatting, Style, Stem, and Choices is simply not a good structure, and they do not even follow it consistently. Some ideas are listed separately in the Stem and Choices sections, and some are listed elsewhere to apply to both parts of an item. The cluing rules uniquely has subparts, but oddly does not include all of the cluing rules/guidelines. This looks like they imposed their own latent thinking, but not mindfully enough to do it well. It’s a mess. Perhaps they hould have a section on Cluing and a section on Clarity, as these themes are common in their list.

The fourth problem with the Haladyna Rules is that it is just sloppy. Despite all the iterations, there are a number of pairs of redundant rules. 13 & 16, 11 & 12, 17 & 27, 29 & 30, 14 & 15, and 23 & 24. Each of these pairs clearly should be consolidated into a single rule, and it is inexplicable to me that they were not. Moreover, the articles offer explanations for just some of their rules, not for all of them. Yes, by 2013 it is less sloppy, but most of the redundancies remain.

From my perspective—from the perspective of me and my frequent co-author—the deepest problem is that lack of understanding of what makes for a high quality item. Items must align with their assessment targets; they should measure what they are supposed to measure, and not something else in the neighborhood. That idea of alignment seems absent from these rules. And for items to be aligned, they must be aligned for the range of test takers, not just the ones like oneself or one’s own students. This is the idea of formal/technical fairness. But their project only has a small hint of that concern, and it is not explicit in any of the actual rules. This is what we refer to as content and cognition.

So, this approach to “Item-Writing Guidelines/Rules/Suggestions/Advice,” as they termed it, was destined to fail. Rather than helping teachers, item writers and/or content development professionals by helping with the hardest parts of item development, they focus on the inane (e.g., “vertically instead of horizontally”), the obvious (e.g., correct spelling), the tautological (e.g., “avoid over specific and over general content”), the false (e.g., “the length of choices about equal”), and the shallow (i.e., just about everything else). It does not acknowledge the importance of item stimulus (e.g., reading passages), or even their existence! Even when there is a decent kernel of an idea, they fall short—usually woefully short.

Could a good list of item writing principles have come out of this project? I don’t know. I really don’t. I do not understand turning to people who lack deep experience in item development and lack a deep understanding of how items function, but it appears that that is the basis for their project. Of course, I understand that that is how scholarship and literature reviews work, but they literally claim that this resulted in a “complete and authoritative” list. Furthermore, their later works do not present this stuff as merely a literature review; these are their guidelines for writing items.

My colleagues and I have seen how these rules have become embedded in the field, in item writing manuals, in style guides, in educational assessment works of scholarship and of instruction. And they are bad. You can review my daily posts last month to see me take each of them down—all but Rule 14—but that’s not entirely necessary. The point is that these rules might help to develop shiny polished item that look good, but they do not really help to develop highly refined items that elicit evidence of the targeted cognition for the range of typical test takers. And it appears that their lists mere existence might have hampered the development of better item writing guidance—what else could be responsible.

Now, the field of educational measurement’s focus on use of item difficulty and item discrimination statistics to evaluate items and to evaluate item writing rules has not helped. This approach undermines domain models and both content- and construct-validity. It renders claims of evidence from test content less than credible, because it bends what are supposed to be criteria-referenced tests into more effective norm-referenced tests. But that is no excuse for this mess. Haladyna et al. themselves are fully on board with this approach to evaluating items, and approvingly refer to such studies of item writing rules in their own work. But none of that explains why they the field has not dived more deeply into what makes for an actually useful item.

Instead, we have the Haladyna Rules—which definitely need to be replaced.

Complex Variety: Assessment Development, Education and Occasional Other Topics

Latest & Greatest

Dr. Hoffman