[Each day in October, I analyze one of the 31 item writing rules from Haladyna, Downing and Rodriquez (2002), the super-dominant list of item authoring guidelines.]

Content: Avoid trick items.

This once sounds good. It really does. We’ve even caught ourselves objecting to an item because it seems “too tricky…,” but we do stop ourselves at that moment because we know that that doesn’t mean anything.

What on earth is a “trick item”? If it is actually aimed at the targeted cognition, if it elicits evidence of the targeted cognition,if it follows the other content rules, what does this rule add? It certainly would help if Haladyna and his colleagues defended and explained it.

It turns out that they do try to explain this one in 2002, but it does not go well. First, they claim that this rule is supported unanimously by their sources, which is true if you ignore the fully one-third that do not mention it at all. As a qualitative researcher, I must say that it is highly misleading to claim that sources unanimously supported a contention without qualifying that claim by pointing out how many abstentions there were. Perhaps this is because this is actually qualitative research and Haladyna et al. simply do not understand appropriate use of quasi-statistics when reporting on qualitative research (Maxwell, 2012), something that even full-time qualitative researchers—which Haladyna et al. are not—can get wrong.

So, only two-thirds of their sources support this rule, though none explicitly oppose it.

Second, yes, this rule is largely redundant with others. The only definition they offer comes from a 1993 empirical study, which offers, “The defining characteristics of trick items, in order of prevalence, included intention, trivial content, too fine answer discrimination, stems with window dressing, multiple correct answers, content presented opposite from instruction, and high ambiguity.” Well, much of that clearly is redundant with other rules. But what does it mean for an item to have “too fine answer discrimination”? What does it matter if the item is opposite instruction if the item matches the assessment target from the domain model?

Third, and this is killer, this 1993 study of trick items failed to show that they exist. That is, participants were unable to tell the difference between intentionally prepared trick items and intentionally prepared non-trick items. Respondents, the same group from whom that set of defining characteristics were drawn, “were not able to differentiate between the two.”

Fourth, and perhaps this should be first, we can bury this rule simply by quoting Haladyna et al. (2002, p. 315).

Roberts (1993) concluded that textbook authors do not give trick items much coverage, owing perhaps to the fact that trick items are not well defined and examples of trick items are not commonplace. His conclusion is consistent with results reported in Table 2 showing that 33% of the textbooks failed to mention trick items. Another good point is that the accusation of trick items on a test may be an excuse by students who lack knowledge and do poorly on a test. We continue to support the idea that trick items should never be used in any test and recommend that the basis for teaching about trick items be consistent with Roberts’s findings.

So, teaching should be consistent with Roberts’ finding about trick items, which is that they are rare, don’t really exist as a category and are kinda just an excuse by test takers who get them wrong? But by all means, let’s be sure to teach that. That is totally a strong basis for an item writing rule, and will really help everyone to identify high quality items.

Roberts, D. M. (1993). An empirical study on the nature of trick test questions. Journal of Educational Measurement, 30, 331–344. < https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=52f91caa6a96e045f0c9af5b845e36e4118fd5df>

[Haladyna et al.’s exercise started with a pair of 1989 articles, and continued in a 2004 book and a 2013 book. But the 2002 list is the easiest and cheapest to read (see the linked article, which is freely downloadable) and it is the only version that includes a well formatted one-page version of the rules. Therefore, it is the central version that I am taking apart, rule by rule, pointing out how horrendously bad this list is and how little it helps actual item development. If we are going to have good standardized tests, the items need to be better, and this list’s place as the dominant item writing advice only makes that far less likely to happen.

Haladyna Lists and Explanations

Haladyna, T. M. (2004). Developing and validating multiple-choice test items. Routledge.
Haladyna, T. M., & Rodriguez, M. C. (2013). Developing and validating test items. Routledge.
Haladyna, T., Downing, S. and Rodriguez, M. (2002). A Review of Multiple-Choice Item-Writing Guidelines for Classroom Assessment. Applied Measurement in Education. 15(3), 309-334
Haladyna, T.M. and Downing, S.M. (1989). Taxonomy of Multiple Choice Item-Writing Rules. Applied Measurement in Education, 2 (1), 37-50
Haladyna, T. M., & Downing, S. M. (1989). Validity of a taxonomy of multiple-choice item-writing rules. Applied measurement in education, 2(1), 51-78.
Haladyna, T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied measurement in education, 15(3), 309-333.

]

Complex Variety: Assessment Development, Education and Occasional Other Topics

Latest & Greatest

Dr. Hoffman