Fisking the Haladyna Rules #14: Clear directions

[Each day in October, I analyze one of the 31 item writing rules from Haladyna, Downing and Rodriquez (2002), the super-dominant list of item authoring guidelines.]

Writing the stem: Ensure that the directions in the stem are very clear.

Yes. Very good rule. No notes.

[Haladyna et al.’s exercise started with a pair of 1989 articles, and continued in a 2004 book and a 2013 book. But the 2002 list is the easiest and cheapest to read (see the linked article, which is freely downloadable) and it is the only version that includes a well formatted one-page version of the rules. Therefore, it is the central version that I am taking apart, rule by rule, pointing out how horrendously bad this list is and how little it helps actual item development. If we are going to have good standardized tests, the items need to be better, and this list’s place as the dominant item writing advice only makes that far less likely to happen.

Haladyna Lists and Explanations

  • Haladyna, T. M. (2004). Developing and validating multiple-choice test items. Routledge.

  • Haladyna, T. M., & Rodriguez, M. C. (2013). Developing and validating test items. Routledge.

  • Haladyna, T., Downing, S. and Rodriguez, M. (2002). A Review of Multiple-Choice Item-Writing Guidelines for Classroom Assessment. Applied Measurement in Education. 15(3), 309-334

  • Haladyna, T.M. and Downing, S.M. (1989). Taxonomy of Multiple Choice Item-Writing Rules. Applied Measurement in Education, 2 (1), 37-50

  • Haladyna, T. M., & Downing, S. M. (1989). Validity of a taxonomy of multiple-choice item-writing rules. Applied measurement in education, 2(1), 51-78.

  • Haladyna, T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied measurement in education, 15(3), 309-333.

]

Fisking the Haladyna Rules #13: Minimize reading

[Each day in October, I analyze one of the 31 item writing rules from Haladyna, Downing and Rodriquez (2002), the super-dominant list of item authoring guidelines.]

Style concerns: Minimize the amount of reading in each item.

This rule is mentioned by one-third of their 2002 sources. But this rule also seems rather redundant with Rule 16, (Avoid window dressing (excessive verbiage)(. Obviously, the irony of repeating a rule amount being mindful of unnecessary verbiage is a joke, right?

So, is reading time a problem? Let’s put aside “excessive verbiage,” because that is Rule 16. This is about reading time.

Yes, reading load should be managed. Reading load matters. Whether or not students face formal time limits, students run out of stamina. Reading load matters. But many standardized test items are based upon reading passages that are included in the test. They have to be included because there is no centralized control over the text that students read in this country and because we often want the texts to be new to the test takers (i.e., so they must lean exclusively on their own reading skills to understand them). But these passages are always far shorter in length than we expect these test takers to be able to understand on their own. Certainly, this is true on ELA (English Language Arts) exams, but it is also true on science exams and social studies exams. When mathematics is actually applied in the real world, it is done in embedded problems and situations, not just as bare arithmetic or algebra. These excerpts and passage are already shorter than we expect students to be able to handle, in authentic circumstances.

Minimizing reading time surely does allow for more items and improve reliability, as their 2013 book says. But minimizing the reading time, as the 2002 article suggests, often simply comes at the expense of the targeted cognition. Sure, if the item is just a basic recall item, keep it short. But if it is a test of reading skills or problem-solving skills (i.e., which often call for recognizing extraneous information), minimizing reading time undermines content- and construct-validity. It looks to a rather quantifiable measure about items (or item sets) and declares that that is more important than the actual purpose or goal of the assessment or the item.

To be fair, their 2004 book does say, “As brief as possible without compromising the content and cognitive demand we require.” But their 2013 book says that minimizing reading time can improve both reliability and validity, which shows a rather poor understanding of validity, in my view. Yes, the 2013 book does acknowledge the occasional importance of extraneous information, but it says nothing about the importance of passage length. And let’s be clear: this rule is not aimed at just the stem, just the answer options or just those two. This is about the whole item—which of course includes the stimulus!

Now, if this rules said, Minimize the amount of reading in each item, without compromising the test content, it would be better. It would almost be good, provided of course that we could all rely on everyone taking that caveat seriously. But the 2002 list—the only version that includes all the rules in a handy one-page format—does not say that. And nowhere in that article does it anywhere offer that caveat. None of the other versions offer that critical caveat as part of the rule, though its inclusions would not make this the longest rule. There are at least half a dozen longer rules, including rules made up of multiple sentences.

So, this could be a good rule, but not as it is presented. As presented, it too often suggests undermining validity.

[Haladyna et al.’s exercise started with a pair of 1989 articles, and continued in a 2004 book and a 2013 book. But the 2002 list is the easiest and cheapest to read (see the linked article, which is freely downloadable) and it is the only version that includes a well formatted one-page version of the rules. Therefore, it is the central version that I am taking apart, rule by rule, pointing out how horrendously bad this list is and how little it helps actual item development. If we are going to have good standardized tests, the items need to be better, and this list’s place as the dominant item writing advice only makes that far less likely to happen.

Haladyna Lists and Explanations

  • Haladyna, T. M. (2004). Developing and validating multiple-choice test items. Routledge.

  • Haladyna, T. M., & Rodriguez, M. C. (2013). Developing and validating test items. Routledge.

  • Haladyna, T., Downing, S. and Rodriguez, M. (2002). A Review of Multiple-Choice Item-Writing Guidelines for Classroom Assessment. Applied Measurement in Education. 15(3), 309-334

  • Haladyna, T.M. and Downing, S.M. (1989). Taxonomy of Multiple Choice Item-Writing Rules. Applied Measurement in Education, 2 (1), 37-50

  • Haladyna, T. M., & Downing, S. M. (1989). Validity of a taxonomy of multiple-choice item-writing rules. Applied measurement in education, 2(1), 51-78.

  • Haladyna, T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied measurement in education, 15(3), 309-333.

]

Fisking the Haladyna Rules #12: Correct grammar

[Each day in October, I analyze one of the 31 item writing rules from Haladyna, Downing and Rodriquez (2002), the super-dominant list of item authoring guidelines.]

Style concerns: Use correct grammar, punctuation, capitalization, and spelling.

I will try to put aside the common mistake of referring to word choice, punctuation, spelling and other conventions of language as “grammar.” Grammar is about syntax. The folks who say they care so much about “grammar” are exactly the folks who get offended by such misuse of language when others do it. This rule ought to be labeled “Language conventions,” not “correct grammar.” But I will put that aside, because although I was an English teacher, I am not of that sort. (I am just bothered by the hypocrisy of these people who look down their noses at the language use of others.)

So, what about evaluating this rule as a rule? Yeah. It’s a good rule. It is something that we ought to all be able to agree on. Heck, I would make it the first rule. It’s not challenging or controversial. It does not really need to be explained.

Buried in the middle of the list? Meh. Not great. Things that we can all agree on should go at the beginning of the list, with more challenging ideas coming later.

But, with my own view of language use—grounded as it is in what I have learned from actual scholarly linguists (i.e., those who study actual grammar and syntax) and the beauty of literature—there really isn’t a “correct grammar.” Rather, there is a preferred style, usually one of formal English, though usually one that falls short of the formality of academic writing. This rule would be much improved if it spoke of “formal grammar” rather than “correct grammar.”

But how would I know? I’m just an English teacher who studied linguistics in college.

(And yeah, this rule is mentioned by barely half of their 2002 sources. Not really a consensus to endorse.)

[Haladyna et al.’s exercise started with a pair of 1989 articles, and continued in a 2004 book and a 2013 book. But the 2002 list is the easiest and cheapest to read (see the linked article, which is freely downloadable) and it is the only version that includes a well formatted one-page version of the rules. Therefore, it is the central version that I am taking apart, rule by rule, pointing out how horrendously bad this list is and how little it helps actual item development. If we are going to have good standardized tests, the items need to be better, and this list’s place as the dominant item writing advice only makes that far less likely to happen.

Haladyna Lists and Explanations

  • Haladyna, T. M. (2004). Developing and validating multiple-choice test items. Routledge.

  • Haladyna, T. M., & Rodriguez, M. C. (2013). Developing and validating test items. Routledge.

  • Haladyna, T., Downing, S. and Rodriguez, M. (2002). A Review of Multiple-Choice Item-Writing Guidelines for Classroom Assessment. Applied Measurement in Education. 15(3), 309-334

  • Haladyna, T.M. and Downing, S.M. (1989). Taxonomy of Multiple Choice Item-Writing Rules. Applied Measurement in Education, 2 (1), 37-50

  • Haladyna, T. M., & Downing, S. M. (1989). Validity of a taxonomy of multiple-choice item-writing rules. Applied measurement in education, 2(1), 51-78.

  • Haladyna, T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied measurement in education, 15(3), 309-333.

]

Fisking the Haladyna Rules #11: Edit and proof

[Each day in October, I analyze one of the 31 item writing rules from Haladyna, Downing and Rodriquez (2002), the super-dominant list of item authoring guidelines.]

Style concerns: Edit and proof items.

We’re just gonna ignore that they left it at “proof” instead of “proofread,” right? I shouldn’t be that petty, right?

Because this rule is one of the style concerns, we know that this is not about substantive editing. It is not in the content section, nor in the stem or choices sections. However, the 2002 article does not provide any explanation. Their 2004 book revises this rule to “Proofread each item,” but the 2013 book goes back to “Edit and proof items.”

That 2013 book addresses style guides at length, and this is…somewhat important. Consistency, while sometimes the hobgoblin of little minds, is something that will be looked for by many people. Others within testing organizations and/or sponsors, in addition to teachers and other members of the public, can notice and pick on inconsistency in style. So, yes, items should be compliant with the relevant style guides. Such style guides ought to end discussion and debate on how something should be presented or how a word is spelled (e.g., email, e-mail, Email, E-Mail, emails, etc.).

However, when most people think about proofreading, they are thinking about…spelling, grammar, punctuation and word choice (i.e., often formality of register). Yes, items should be edited for style and these concerns. And yes, proof-reading is important. But here’s the thing: this rule is not a description of anything about the items. This is the only rule that is only about test developers actions. And we know that this rule is not about “grammar, punctuation, capitalization and spelling” because that is the next rule.

Note that this rule does not actually say anything about style guides in any version of the Haladyna rules. One of the explanations (2013) mentions the importance of style guides, but even that version does not mention them in the rule itself. This is entirely about process, and not about product. This is about all published writing and is in no way particular to items.

It is probably dumb to include this in a set of item writing rules or guidelines, and it certainly is particularly dumb to include it in this kind of list of rules or guidelines. In fact, only one-third of their 2002 sources even mention it because it probably should go without saying!

[Haladyna et al.’s exercise started with a pair of 1989 articles, and continued in a 2004 book and a 2013 book. But the 2002 list is the easiest and cheapest to read (see the linked article, which is freely downloadable) and it is the only version that includes a well formatted one-page version of the rules. Therefore, it is the central version that I am taking apart, rule by rule, pointing out how horrendously bad this list is and how little it helps actual item development. If we are going to have good standardized tests, the items need to be better, and this list’s place as the dominant item writing advice only makes that far less likely to happen.

Haladyna Lists and Explanations

  • Haladyna, T. M. (2004). Developing and validating multiple-choice test items. Routledge.

  • Haladyna, T. M., & Rodriguez, M. C. (2013). Developing and validating test items. Routledge.

  • Haladyna, T., Downing, S. and Rodriguez, M. (2002). A Review of Multiple-Choice Item-Writing Guidelines for Classroom Assessment. Applied Measurement in Education. 15(3), 309-334

  • Haladyna, T.M. and Downing, S.M. (1989). Taxonomy of Multiple Choice Item-Writing Rules. Applied Measurement in Education, 2 (1), 37-50

  • Haladyna, T. M., & Downing, S. M. (1989). Validity of a taxonomy of multiple-choice item-writing rules. Applied measurement in education, 2(1), 51-78.

  • Haladyna, T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied measurement in education, 15(3), 309-333.

]

Fisking the Haladyna Rules #10: Format items vertically

[Each day in October, I analyze one of the 31 item writing rules from Haladyna, Downing and Rodriquez (2002), the super-dominant list of item authoring guidelines.]

Formatting concerns: Format the item vertically instead of horizontally.

This one is just dumb. The New York Regents exams violate it all the time. Less than half of Haladyna et al.’s 2002 sources even mention this dumb rule, and nearly one quarter of them explicitly disagree.

In our own research, over half of respondents said that this is irrelevant, though the vast majority of the rest agreed that it is a good thing—though at the lowest level of value (i.e., Useful, as opposed to Important or Very Important).

There certainly is no consensus on this, and Haladyna et al. write, “We have no research evidence to argue that horizontal formatting might affect student performance. Nonetheless, we side with the authors who format their items vertically.” This is not a good basis for including a rule on a list that is supposed to be grounded in the consensus of the literature. It makes clear that this list is little more than a collection of their own opinions masquerading as research findings.

And yet, their 2013 book calls this an “important” (p. 95) item writing guideline. Nowhere do they cite any evidence for this, though they hypothesize that vertical formatting may be less confusing specifically for anxious test takers…without a milligram of support for this contention.

Yeah, “important.” Totally.

[Haladyna et al.’s exercise started with a pair of 1989 articles, and continued in a 2004 book and a 2013 book. But the 2002 list is the easiest and cheapest to read (see the linked article, which is freely downloadable) and it is the only version that includes a well formatted one-page version of the rules. Therefore, it is the central version that I am taking apart, rule by rule, pointing out how horrendously bad this list is and how little it helps actual item development. If we are going to have good standardized tests, the items need to be better, and this list’s place as the dominant item writing advice only makes that far less likely to happen.

Haladyna Lists and Explanations

  • Haladyna, T. M. (2004). Developing and validating multiple-choice test items. Routledge.

  • Haladyna, T. M., & Rodriguez, M. C. (2013). Developing and validating test items. Routledge.

  • Haladyna, T., Downing, S. and Rodriguez, M. (2002). A Review of Multiple-Choice Item-Writing Guidelines for Classroom Assessment. Applied Measurement in Education. 15(3), 309-334

  • Haladyna, T.M. and Downing, S.M. (1989). Taxonomy of Multiple Choice Item-Writing Rules. Applied Measurement in Education, 2 (1), 37-50

  • Haladyna, T. M., & Downing, S. M. (1989). Validity of a taxonomy of multiple-choice item-writing rules. Applied measurement in education, 2(1), 51-78.

  • Haladyna, T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied measurement in education, 15(3), 309-333.

]

Fisking the Haladyna Rules #9: Use best item format

[Each day in October, I analyze one of the 31 item writing rules from Haladyna, Downing and Rodriquez (2002), the super-dominant list of item authoring guidelines.]

Formatting concerns: Use the question, completion, and best answer versions of the conventional MC, the alternate choice, true-false (TF), multiple true-false (MTF), matching, and the context-dependent item and item set formats, but AVOID the complex MC (Type K) format.

Though Haladyna et al put this in the short formatting concerns section of their list, this rule is not about formatting. This rule is about the very structure of the item. True-False items are quite different than a classic three- or four-option multiple choice item. A matching item question  (i.e., matching each of 3-8 choices in column A with the appropriate option(s) in column B) is entirely different than either of the others. This not merely “formatting;” these are all item types.

Content-dependent items and item sets are not merely a matter of formatting, either. They are items linked to a common stimulus or common topic. But they can each be of any item type.

So, this rule says that it is ok to use different item types? Oh, OK. It is ok to have items sets? Oh, OK.

What is this rule really saying? All that it is really saying is do not use complex MC items. Those are the ones that ask a question, list some possible answers and then give a list of combinations of answers to select from. For example,

Which of these rules are actually decent rules?

I. Keep vocabulary simple for the group of students being tested.

II. Avoid trick items.

III. Minimize the amount of reading in each item.

IV. Place choices in logical or numerical order.

a) I only

b) I and III, only

c) II and IV only

d) II, III and IV only

e) I, II, III and IV

Yes, we grew up with this item type. Yes, this item type is needlessly confusing. But the rule should be something like “Replace complex MC (type K) items with multiple true false or multiple select items.” Unfortunately, 80% of the rule is about other things, and the part of their rule that starts to get at this is buried at the end. Moreover, the rule itself does not say what to do about this problem, whereas our offered replacement is direct and helpful.

[Haladyna et al.’s exercise started with a pair of 1989 articles, and continued in a 2004 book and a 2013 book. But the 2002 list is the easiest and cheapest to read (see the linked article, which is freely downloadable) and it is the only version that includes a well formatted one-page version of the rules. Therefore, it is the central version that I am taking apart, rule by rule, pointing out how horrendously bad this list is and how little it helps actual item development. If we are going to have good standardized tests, the items need to be better, and this list’s place as the dominant item writing advice only makes that far less likely to happen.

Haladyna Lists and Explanations

  • Haladyna, T. M. (2004). Developing and validating multiple-choice test items. Routledge.

  • Haladyna, T. M., & Rodriguez, M. C. (2013). Developing and validating test items. Routledge.

  • Haladyna, T., Downing, S. and Rodriguez, M. (2002). A Review of Multiple-Choice Item-Writing Guidelines for Classroom Assessment. Applied Measurement in Education. 15(3), 309-334

  • Haladyna, T.M. and Downing, S.M. (1989). Taxonomy of Multiple Choice Item-Writing Rules. Applied Measurement in Education, 2 (1), 37-50

  • Haladyna, T. M., & Downing, S. M. (1989). Validity of a taxonomy of multiple-choice item-writing rules. Applied measurement in education, 2(1), 51-78.

  • Haladyna, T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied measurement in education, 15(3), 309-333.

]

Fisking the Haladyna Rules #8: Simple vocabulary

[Each day in October, I analyze one of the 31 item writing rules from Haladyna, Downing and Rodriquez (2002), the super-dominant list of item authoring guidelines.]

Content: Keep vocabulary simple for the group of students being tested.

Hey! I like this one. This is good. Not even any claims of unanimity, when a large portion of their sources don’t even mention it.

Now, I wouldn’t quite say it this way. The issue is not really just about vocabulary, as their explanation shows (i.e., they mention “reading demand” and “simplified language”). Syntax, style, sentence length, paragraph construction and even passage length can be quite relevant. So, we could improve this rule, but it’s very much in the right ballpark.

What is truly great about this rule is that it acknowledges that it matters who are being tested. What is appropriate depends on this group of test takers. They cite testing ELL (English Language Learners) students, which is a natural concern. But this matters across all population, and it also gets to regionalisms and some other dimensions of fairness. We love that this rule acknowledges the existence of diversity and difference among test takers, and that it should shape how we all write, refine and evaluate items.

[Haladyna et al.’s exercise started with a pair of 1989 articles, and continued in a 2004 book and a 2013 book. But the 2002 list is the easiest and cheapest to read (see the linked article, which is freely downloadable) and it is the only version that includes a well formatted one-page version of the rules. Therefore, it is the central version that I am taking apart, rule by rule, pointing out how horrendously bad this list is and how little it helps actual item development. If we are going to have good standardized tests, the items need to be better, and this list’s place as the dominant item writing advice only makes that far less likely to happen.

Haladyna Lists and Explanations

  • Haladyna, T. M. (2004). Developing and validating multiple-choice test items. Routledge.

  • Haladyna, T. M., & Rodriguez, M. C. (2013). Developing and validating test items. Routledge.

  • Haladyna, T., Downing, S. and Rodriguez, M. (2002). A Review of Multiple-Choice Item-Writing Guidelines for Classroom Assessment. Applied Measurement in Education. 15(3), 309-334

  • Haladyna, T.M. and Downing, S.M. (1989). Taxonomy of Multiple Choice Item-Writing Rules. Applied Measurement in Education, 2 (1), 37-50

  • Haladyna, T. M., & Downing, S. M. (1989). Validity of a taxonomy of multiple-choice item-writing rules. Applied measurement in education, 2(1), 51-78.

  • Haladyna, T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied measurement in education, 15(3), 309-333.

]

Fisking the Haladyna Rules #7: Avoid trick items

[Each day in October, I analyze one of the 31 item writing rules from Haladyna, Downing and Rodriquez (2002), the super-dominant list of item authoring guidelines.]

Content: Avoid trick items.

This once sounds good. It really does. We’ve even caught ourselves objecting to an item because it seems “too tricky…,” but we do stop ourselves at that moment because we know that that doesn’t mean anything.

What on earth is a “trick item”? If it is actually aimed at the targeted cognition, if it elicits evidence of the targeted cognition,if it follows the other content rules, what does this rule add? It certainly would help if Haladyna and his colleagues defended and explained it.

It turns out that they do try to explain this one in 2002, but it does not go well. First, they claim that this rule is supported unanimously by their sources, which is true if you ignore the fully one-third that do not mention it at all. As a qualitative researcher, I must say that it is highly misleading to claim that sources unanimously supported a contention without qualifying that claim by pointing out how many abstentions there were. Perhaps this is because this is actually qualitative research and Haladyna et al. simply do not understand appropriate use of quasi-statistics when reporting on qualitative research (Maxwell, 2012), something that even full-time qualitative researchers—which Haladyna et al. are not—can get wrong. 

So, only two-thirds of their sources support this rule, though none explicitly oppose it.

Second, yes, this rule is largely redundant with others. The only definition they offer comes from a 1993 empirical study, which offers, “The defining characteristics of trick items, in order of prevalence, included intention, trivial content, too fine answer discrimination, stems with window dressing, multiple correct answers, content presented opposite from instruction, and high ambiguity.” Well, much of that clearly is redundant with other rules. But what does it mean for an item to have “too fine answer discrimination”? What does it matter if the item is opposite instruction if the item matches the assessment target from the domain model?

Third, and this is killer, this 1993 study of trick items failed to show that they exist. That is, participants were unable to tell the difference between intentionally prepared trick items and intentionally prepared non-trick items. Respondents, the same group from whom that set of defining characteristics were drawn, “were not able to differentiate between the two.”

Fourth, and perhaps this should be first, we can bury this rule simply by quoting Haladyna et al. (2002, p. 315).

Roberts (1993) concluded that textbook authors do not give trick items much coverage, owing perhaps to the fact that trick items are not well defined and examples of trick items are not commonplace. His conclusion is consistent with results reported in Table 2 showing that 33% of the textbooks failed to mention trick items. Another good point is that the accusation of trick items on a test may be an excuse by students who lack knowledge and do poorly on a test. We continue to support the idea that trick items should never be used in any test and recommend that the basis for teaching about trick items be consistent with Roberts’s findings.

So, teaching should be consistent with Roberts’ finding about trick items, which is that they are rare, don’t really exist as a category and are kinda just an excuse by test takers who get them wrong? But by all means, let’s be sure to teach that. That is totally a strong basis for an item writing rule, and will really help everyone to identify high quality items.

Roberts, D. M. (1993). An empirical study on the nature of trick test questions. Journal of Educational Measurement, 30, 331–344. < https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=52f91caa6a96e045f0c9af5b845e36e4118fd5df>

[Haladyna et al.’s exercise started with a pair of 1989 articles, and continued in a 2004 book and a 2013 book. But the 2002 list is the easiest and cheapest to read (see the linked article, which is freely downloadable) and it is the only version that includes a well formatted one-page version of the rules. Therefore, it is the central version that I am taking apart, rule by rule, pointing out how horrendously bad this list is and how little it helps actual item development. If we are going to have good standardized tests, the items need to be better, and this list’s place as the dominant item writing advice only makes that far less likely to happen.

Haladyna Lists and Explanations

  • Haladyna, T. M. (2004). Developing and validating multiple-choice test items. Routledge.

  • Haladyna, T. M., & Rodriguez, M. C. (2013). Developing and validating test items. Routledge.

  • Haladyna, T., Downing, S. and Rodriguez, M. (2002). A Review of Multiple-Choice Item-Writing Guidelines for Classroom Assessment. Applied Measurement in Education. 15(3), 309-334

  • Haladyna, T.M. and Downing, S.M. (1989). Taxonomy of Multiple Choice Item-Writing Rules. Applied Measurement in Education, 2 (1), 37-50

  • Haladyna, T. M., & Downing, S. M. (1989). Validity of a taxonomy of multiple-choice item-writing rules. Applied measurement in education, 2(1), 51-78.

  • Haladyna, T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied measurement in education, 15(3), 309-333.

]

Fisking the Haladyna Rules #6: Avoid opinions

[Each day in October, I analyze one of the 31 item writing rules from Haladyna, Downing and Rodriquez (2002), the super-dominant list of item authoring guidelines.]

Content: Avoid opinion-based items.

Just about a quarter of their 2002 sources even mentioned this rule, and while a higher share of their 1989 sources mention it, over a quarter of those sources argue against it. The 1989 article sets a 70% threshold (of those who bother to mention the rule) to establish a consensus, and seems tailored to this rule. Of course, in their opinion that’s a perfectly fine way to claim a consensus, when 70% of those who bother to mention it don’t argue against it.

But this rule seems to go against the idea from other rules that we should assess critical reasoning. Heck, Haladyna has a book on assessing higher order thinking. In fact, so much of higher order thinking in the humanities and social science is about evaluating opinion. Expert judgment is a form of opinion, something that experts usually—but not always—agree on because of their common experiences and understanding. Surely, opinion should be grounded, and I taught my students to explain and defend their opinions. Opinions are important.

Unfortunately, the 2002 article does not explain what they mean. But in their books, they offer the two questions, “What is the best comedy film ever made?” and “According to the Film Institute of New York, what is the greatest American comedy film ever made?” They reject the former question as an “unqualified opinion” (i.e., not supported by “documented source, evidence or presentation cited in a curriculum”) which they think is bad. They accept the latter as a “qualified opinion,” which they think is good. (Of course, ignoring the fact that there is no Film Institute of New York.) If the problem is “unqualified” opinions their rule should say that, but it doesn’t.

I believe that this shows that this is a ridiculous rule. Some opinion-based items are fine, even for multiple choice items. But some are not. Obviously, simply asking the test taker to identify the item developer’s opinion is hugely problematic, but the problem there is that the item is asking test takers to read the item developer’s mind, not that it is asking about an opinion. Asking the test taker to identify the opinion of a character in or author of passage is fine—even highly appropriate. The problem has nothing to do with the fact that these items are about opinions; the problem is asking test takers to read item developer’s minds.

[Haladyna et al.’s exercise started with a pair of 1989 articles, and continued in a 2004 book and a 2013 book. But the 2002 list is the easiest and cheapest to read (see the linked article, which is freely downloadable) and it is the only version that includes a well formatted one-page version of the rules. Therefore, it is the central version that I am taking apart, rule by rule, pointing out how horrendously bad this list is and how little it helps actual item development. If we are going to have good standardized tests, the items need to be better, and this list’s place as the dominant item writing advice only makes that far less likely to happen.

Haladyna Lists and Explanations

  • Haladyna, T. M. (2004). Developing and validating multiple-choice test items. Routledge.

  • Haladyna, T. M., & Rodriguez, M. C. (2013). Developing and validating test items. Routledge.

  • Haladyna, T., Downing, S. and Rodriguez, M. (2002). A Review of Multiple-Choice Item-Writing Guidelines for Classroom Assessment. Applied Measurement in Education. 15(3), 309-334

  • Haladyna, T.M. and Downing, S.M. (1989). Taxonomy of Multiple Choice Item-Writing Rules. Applied Measurement in Education, 2 (1), 37-50

  • Haladyna, T. M., & Downing, S. M. (1989). Validity of a taxonomy of multiple-choice item-writing rules. Applied measurement in education, 2(1), 51-78.

  • Haladyna, T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied measurement in education, 15(3), 309-333.

]

Fisking the Haladyna Rules #5: Avoid over specific/general

[Each day in October, I analyze one of the 31 item writing rules from Haladyna, Downing and Rodriquez (2002), the super-dominant list of item authoring guidelines.]

Content: Avoid over specific and over general content when writing MC items.

Their fifth rule has two fundamental problems. First, it is supported by just 15% of their 2002 sources and not even that many of their 1989 sources. Furthermore, fully 25% of their 1989 sources argue against this rule. But that general lack of support is not the biggest problem.

The biggest problem is that this rule does not mean anything. Do not do something too much is all they are saying. Sure, some people say you can never be too rich or too thin, but other than that sort of formulation, saying too (i.e., as in too much) or over (i.e., as in too much) makes the rule a tautology. That is, it is a logical circle. Yes, it would be bad for the item to be too hard. Don’t do that. It would be bad for the item to be too specific. Don’t do that. It would be bad for the font to be too small, too big, too baroque. Again, that’s just what “too” or “over” mean!

So, the real question is what it means to be over specific or over general. They seem to be saying that such things exist, but provide no guidance whatsoever for what they mean. They are providing an objection, but no basis for when it applies—of even what it really means. Nothing. Just nothing.

I do not think that I hate any of Haladyna et al.’s rules more than this one. It epitomizes the problem with their whole approach. Perhaps there is value in their original articles as literature review. Perhaps. But they have brought these lists forwards into handbooks and their own books. Others cite them and quote them all the time. They called their list, “a complete and authoritative set of guidelines for writing multiple-choice items.” Is it really? Does this example from their “Guidelines/Rules/Suggestions/Advice” (as they called them in 1989) actually help anyone to write or evaluate items?

Obviously not. I do not think it could be any more obvious.

[Haladyna et al.’s exercise started with a pair of 1989 articles, and continued in a 2004 book and a 2013 book. But the 2002 list is the easiest and cheapest to read (see the linked article, which is freely downloadable) and it is the only version that includes a well formatted one-page version of the rules. Therefore, it is the central version that I am taking apart, rule by rule, pointing out how horrendously bad this list is and how little it helps actual item development. If we are going to have good standardized tests, the items need to be better, and this list’s place as the dominant item writing advice only makes that far less likely to happen.

Haladyna Lists and Explanations

  • Haladyna, T. M. (2004). Developing and validating multiple-choice test items. Routledge.

  • Haladyna, T. M., & Rodriguez, M. C. (2013). Developing and validating test items. Routledge.

  • Haladyna, T., Downing, S. and Rodriguez, M. (2002). A Review of Multiple-Choice Item-Writing Guidelines for Classroom Assessment. Applied Measurement in Education. 15(3), 309-334

  • Haladyna, T.M. and Downing, S.M. (1989). Taxonomy of Multiple Choice Item-Writing Rules. Applied Measurement in Education, 2 (1), 37-50

  • Haladyna, T. M., & Downing, S. M. (1989). Validity of a taxonomy of multiple-choice item-writing rules. Applied measurement in education, 2(1), 51-78.

  • Haladyna, T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied measurement in education, 15(3), 309-333.

]

FISKING THE HALADYNA RULES #4: Keep items independent

[Each day in October, I analyze one of the 31 item writing rules from Haladyna, Downing and Rodriquez (2002), the super-dominant list of item authoring guidelines.]

Content: Keep the content of each item independent from content of other items on the test.

There is no explanation of this rule anywhere in either Haladyna and his colleagues’ 1989 article or 2002 article. Moreover, less than half of their sources for either article mention this rule. Their companion 1989 article does not list a single empirical study on this rule. And yet, they still call this a consensus.

One must turn to their 2004 or 2013 books to find an explanation. No, the explanation is not convincing. They offer a pair of items as a counter-example to demonstrate what is wrong. Note, this is THEIR example. They change the names, but I will use the more recent ones (2013).

Who was Kay’s best friend?

a. Wendy

b. Betty

c. *Tilda

Who was quarreling with Tilda?

a. Kay

b. Wendy

c. Betty

They claim this shows what is wrong because if the test taker knows that Tilda is the correct response to the first item then they will know that Kay cannot be the correct answer to the second item. Because…stories are never about good friends quarrelling? Not at all. Of course not. Where would the drama or story arc be in that?

Frankly, this rule makes it more difficult to ask anything but the most trivial questions about literary passages because the themes and characters run through them. No, those items are not independent in topic; after all, they are taken from the same story. This is almost as great a problem when using informational passages.

Now, there is an issue with the independence of items, one that I and one of my closest colleagues disagree on. As a science educator, she wants items that scaffold up to larger or deeper understanding. She thinks of item sets as a single larger items with various components, even when they technically are not. She wants later items to build upon the answers of earlier items—and even wants the structure of the item set to help test takers to do that. I really do appreciate what she is trying to do, and as a classroom teacher I might do the same thing. But we are both in large scale standardized assessment now. We are trying to find that optimal (or least bad) balance that allows test takers an opportunity at every point on an assessment, assesses test takers across the range of an NGSS standard (i.e., performance expectation), and yet does not provide so much scaffolding that we cannot be sure whether the test taker actually has the inferred proficiency.

How independent should items be of each other? Not nearly so much as Haladyna et al. claim. And their example is laughable.

[Haladyna et al.’s exercise started with a pair of 1989 articles, and continued in a 2004 book and a 2013 book. But the 2002 list is the easiest and cheapest to read (see the linked article, which is freely downloadable) and it is the only version that includes a well formatted one-page version of the rules. Therefore, it is the central version that I am taking apart, rule by rule, pointing out how horrendously bad this list is and how little it helps actual item development. If we are going to have good standardized tests, the items need to be better, and this list’s place as the dominant item writing advice only makes that far less likely to happen.

Haladyna Lists and Explanations

  • Haladyna, T. M. (2004). Developing and validating multiple-choice test items. Routledge.

  • Haladyna, T. M., & Rodriguez, M. C. (2013). Developing and validating test items. Routledge.

  • Haladyna, T., Downing, S. and Rodriguez, M. (2002). A Review of Multiple-Choice Item-Writing Guidelines for Classroom Assessment. Applied Measurement in Education. 15(3), 309-334

  • Haladyna, T.M. and Downing, S.M. (1989). Taxonomy of Multiple Choice Item-Writing Rules. Applied Measurement in Education, 2 (1), 37-50

  • Haladyna, T. M., & Downing, S. M. (1989). Validity of a taxonomy of multiple-choice item-writing rules. Applied measurement in education, 2(1), 51-78.

  • Haladyna, T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied measurement in education, 15(3), 309-333.

]

FISKING THE HALADYNA RULES #3: Use novel material

[Each day in October, I analyze one of the 31 item writing rules from Haladyna, Downing and Rodriquez (2002), the super-dominant list of item authoring guidelines.]

Content: Use novel material to test higher level learning. Paraphrase textbook language or language used during instruction when used in a test item to avoid testing for simply recall.

There is something in this rule that I like, but it really falls off. If there is a need for a second sentence, it certainly should not be that one.

Novelty in items is incredibly important, but it should not be limited to testing “higher level learning.” Honestly, what is higher level learning? What are they talking about? Later in the article they say problem solving and critical thinking, but that leaves us wondering why they didn’t just say that in the rule.

Also in the text, they write of this particular rule as an example of a rule without an empirical basis, but that that they advocate anyway. That might be fine, but check out their reasoning.

We might argue that many of these unresearched guidelines were common sense that did not justify research. For example, consider the guideline: “Use novel material to test higher level learning.” Most educators place a great value on teaching and testing higher level thinking. Thus, this item-writing guideline may be a consensus value among testing specialists without the need for research.

They offer that because educators agree on the goal, this is the right method. They do not even try to explain why or how novelty is so important for “higher level learning.” As an English teacher, I would read that lack of explanation as a lack of investment in thinking, which is so often the case with this list.

In fact, novelty is important for a wide range of cognition, and the issue is not solved simply by paraphrasing. If the example in an item is the same as one used in a teacher’s lecture or a some class activity, the test taker might simply recall the answer given to them by their teacher—rather than generate it themself as the test assumes they would. If it a reading passage is taken from a work read for class, they might recall their teacher’s (or fellow students’) explanation or analysis, rather than generate their own. For example, if you want to test whether they know about the dynamics of the Romeo & Juliet balcony scene, by all means use that excerpt. But if you want to know whether they can read Shakespeare’s language and understand it, you need to present something that test takers have not already had explained to them. Note that this is not simply about critical thinking, as it could simply be about understanding the plain language.

In fact, items need to be sufficiently novel such that test takers cannot simply rely on their memory of how that example was already explained to them, but not so novel that it requires significant new learning in order to make sense of. That can be a careful balance, and it is made all the more difficult because our formal curriculum can vary so from district to district, and even where the same formal curriculum exists, the enacted curriculum and lesson plans can vary enormously. It take real knowledge of how content is generally taught to find the appropriate level of novelty.

Taking derivatives (i.e., differential calculus) likely counts as problem-solving in Haladyna et al’s view. But the simple items should not ask about x-squared (2^2). That simply is not novel enough. Asking about x8 or x197 is no more difficulty for someone who understands, and yet is not going to simply recalled. However, I think that such a task does not rise to the level of critical thinking or problem solving. It is clearly what Webb’s Depth of Knowledge (DOK) and our own revised Depth of Knowledge (rDOK) would classify at the lowest level of cognitive complexity. And yet, that same need for novelty is just as necessary.

Yes, novelty is a very important idea in item development. But no, this rule does not get at it accurately. It is affirmatively damaging because it suggests that novelty is not important outside of “higher level learning” and that paraphrasing is a sufficient mitigation.

[Haladyna et al.’s exercise started with a pair of 1989 articles, and continued in a 2004 book and a 2013 book. But the 2002 list is the easiest and cheapest to read (see the linked article, which is freely downloadable) and it is the only version that includes a well formatted one-page version of the rules. Therefore, it is the central version that I am taking apart, rule by rule, pointing out how horrendously bad this list is and how little it helps actual item development. If we are going to have good standardized tests, the items need to be better, and this list’s place as the dominant item writing advice only makes that far less likely to happen.

Haladyna Lists and Explanations

  • Haladyna, T. M. (2004). Developing and validating multiple-choice test items. Routledge.

  • Haladyna, T. M., & Rodriguez, M. C. (2013). Developing and validating test items. Routledge.

  • Haladyna, T., Downing, S. and Rodriguez, M. (2002). A Review of Multiple-Choice Item-Writing Guidelines for Classroom Assessment. Applied Measurement in Education. 15(3), 309-334

  • Haladyna, T.M. and Downing, S.M. (1989). Taxonomy of Multiple Choice Item-Writing Rules. Applied Measurement in Education, 2 (1), 37-50

  • Haladyna, T. M., & Downing, S. M. (1989). Validity of a taxonomy of multiple-choice item-writing rules. Applied measurement in education, 2(1), 51-78.

  • Haladyna, T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied measurement in education, 15(3), 309-333.

Fisking the Haladyna Rules #2: Important, not trivial content

[Each day in October, I analyze one of the 31 item writing rules from Haladyna, Downing and Rodriquez (2002), the super-dominant list of item authoring guidelines.]

Base each item on important content to learn; avoid trivial content.

Oh, we hate this one. We think that this rule undermines any concept of item or test alignment, even though it 42 of their 54 sources apparently support it.

It simply is not for item developers to decide what to test. Standards or other forms of domain modelling lay out what should be tested. It is the job of item developers to figure out how to assessment that content, not whether to assess that target.

Furthermore, this rule rather falls towards a tautology. Obviously, given limited testing time—as is invariably the case—the time should be spent well. Yeah, only test things worth testing. Perhaps this rule does not quite reach the level of tautology, but it does not get beyond too obvious to need to be said. As is, either it is too obvious, or begs the question. That is, what counts as trivial? What counts as important. Do they give any guidance on that?

We would prefer it to be phrased as, Don’t waste test takers’ time. But should that really need to be said?

Now, there is a related point that they are not making here, but is very very important. That is, when creating an item aligned to some assessment target, aim for the core of that target. Aim for the most important part of the standard, the part this most useful or most likely to be built upon later. Do not simply aim for the easiest part of the target. Yes, that would easier. But it would not help test validity in any way, would not help test takes or other users of tests.

But that is not what this rule is about.

[Haladyna et al.’s exercise started with a pair of 1989 articles, and continued in a 2004 book and a 2013 book. But the 2002 list is the easiest and cheapest to read (see the linked article, which is freely downloadable) and it is the only version that includes a well formatted one-page version of the rules. Therefore, it is the central version that I am taking apart, rule by rule, pointing out how horrendously bad this list is and how little it helps actual item development. If we are going to have good standardized tests, the items need to be better, and this list’s place as the dominant item writing advice only makes that far less likely to happen.

Haladyna Lists and Explanations

  • Haladyna, T. M. (2004). Developing and validating multiple-choice test items. Routledge.

  • Haladyna, T. M., & Rodriguez, M. C. (2013). Developing and validating test items. Routledge.

  • Haladyna, T., Downing, S. and Rodriguez, M. (2002). A Review of Multiple-Choice Item-Writing Guidelines for Classroom Assessment. Applied Measurement in Education. 15(3), 309-334

  • Haladyna, T.M. and Downing, S.M. (1989). Taxonomy of Multiple Choice Item-Writing Rules. Applied Measurement in Education, 2 (1), 37-50

  • Haladyna, T. M., & Downing, S. M. (1989). Validity of a taxonomy of multiple-choice item-writing rules. Applied measurement in education, 2(1), 51-78.

  • Haladyna, T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied measurement in education, 15(3), 309-333.

]

Fisking the Haladyna Rules #1: Single content and behavior

[Each day in October, I analyze one of the 31 item writing rules from Haladyna, Downing and Rodriquez (2002), the super-dominant list of item authoring guidelines.]

Content: Every item should reflect specific content and a single specific mental behavior, as called for in test specifications (two-way grid, test blueprint).

We actually like this rule. This is a good start for the Haladyna list. No, this 2002 rule has no real antecedent in the 1989 list, but that is to the 2002 list’s credit. And this rule is only supported by the sources who mention it, though that is not quite three quarters of them.

The problem we have with this rule is that Haladyna and his colleagues never explain it anywhere. We have found that, in practice, the meaning of and reasoning behind this rule is often unknown. Frankly, we cannot be sure that that Haladyna et al. even mean what we would mean by this rule, and that’s a real problem.

We believe that it is important that each item be aligned to one specific assessment target. That targeted cognition should come from a domain model. Quite often, this is a standard from a set of state learning standards, or it could be some element from a job or role analysis. We believe in domain modelling and domain analysis; we love ECD (i.e., evidence centered design). (We recognize that the good work done by ECD to highlight the importance of domain models came after 2002, so we forgive Haladyna et al. for thinking that assessment targets just come out of test specifications.) 

We know that it is important that items each target just one thing because otherwise there would be no way to determine why a test taker responded to the item unsuccessfully. They could just be making the same mistake over and over again, each time one standard is part of an item, even though they have mastery over all the rest. We should not be basing inferences of the successful learning, or teaching or coverage of a standards (i.e., when evaluating students, teachers or curriculum, respectively) with such ambiguous sorts of evidence.

Just as importantly…well, actually more importantly, each item should actually depend appropriately on that targeted cognition. There should not be alternative paths to a successful response available to test takers. They should have to make use of the target cognition to get to a successful response and that targeted cognition should be the key step (i.e., the thing they mess up, if they are going to mess anything up). Otherwise, items can yield false-positive or false negative evidence. 

Is all of that clear in how Haladyna et al. phrased their rule? Is it made clear elsewhere in their article? Is it made more clear elsewhere in all their writings? Not really.

[Haladyna et al.’s exercise started with a pair of 1989 articles, and continued in a 2004 book and a 2013 book. But the 2002 list is the easiest and cheapest to read (see the linked article, which is freely downloadable) and it is the only version that includes a well formatted one-page version of the rules. Therefore, it is the central version that I am taking apart, rule by rule, pointing out how horrendously bad this list is and how little it helps actual item development. If we are going to have good standardized tests, the items need to be better, and this list’s place as the dominant item writing advice only makes that far less likely to happen.

Haladyna Lists and Explanations

  • Haladyna, T. M. (2004). Developing and validating multiple-choice test items. Routledge.

  • Haladyna, T. M., & Rodriguez, M. C. (2013). Developing and validating test items. Routledge.

  • Haladyna, T., Downing, S. and Rodriguez, M. (2002). A Review of Multiple-Choice Item-Writing Guidelines for Classroom Assessment. Applied Measurement in Education. 15(3), 309-334

  • Haladyna, T.M. and Downing, S.M. (1989). Taxonomy of Multiple Choice Item-Writing Rules. Applied Measurement in Education, 2 (1), 37-50

  • Haladyna, T. M., & Downing, S. M. (1989). Validity of a taxonomy of multiple-choice item-writing rules. Applied measurement in education, 2(1), 51-78.

  • Haladyna, T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied measurement in education, 15(3), 309-333.

]

Sometimes, You Lose

This week, my local school board voted on a final appeal by a very small group of people in my district to ban five books. The school board used this special working group meetings to act more like a book group than a policy and oversight board, which troubled me. But at the the end of the discussion of each book, they voted not to ban it from the classroom or the libraries of the town.

(Toni Morrison will not be banned, here. Books about coming out as gay will not be banned here. Sex education reference books will not be banned, here. Book that display honest and truthful depictions of the teen experience will not be banner, here.)

Yes, that is a victory. But it is a battle that should not have to be fought.

You see, we had a school board election a couple of years ago in which the candidates newly nominated by the recently radicalized/trumpified local Republican committee—which included none of the Republican incumbents—all lost, and lost badly. They lost even though by local regulation, no single party can hold more than a bare majority of seats on the school board.

Which means that the Trump-Republican candidates got the GOP line on the ballot and the GOP incumbents had to run as independents. And the crazies still lost 2-to1. I do not mean that they only got 1/3 of the the school board; I mean that their candidates only got 1/3 of the vote. And I’d bet serious money that most of that was simply habitual supporters of GOP candidates, as opposed to voters who understood what they were running on.

These losers simply do not accept that they have lost. The community heard what they have to say about these values that they wish to rule by, and the community turned them down decisively. The community does not believe that educators are groomers or that our public schools are at threat to children or families.

I wish that these kind of refusal to accept that they have not convinced anyone was limited to this district. Unfortunately, we see the losers on bigger stage than this. Today, the Supreme Court agreed to hear a case in which these same sorts of people want private businesses who do not want to be associated with their garbage to be forced to publish it, bearing whatever it costs them. Freedom of association and freedom of expression be damned!

As Twitter (now X) makes more room for this garbage on its own platform, advertisers are running away. The market speaks.

But these losers—in elections and in the market—simply refuse to listen. They want to force their ideology on others, abandoning any respect for democracy or markets.

It is bad enough that their ideas are odious, grounded in fantasies and other untruths, based on a need to hate and demean others, theocracy and quite often naked racism. No, that’s not enough. They also refuse to accept when they have lost or adjust their goals—or even their tactics—for that reality.

I do not know how to live in a society with people like that. I just don’t.