What is ECD's View of Evidence?

However, we’ve come to realize that ECD is actually short when it comes to evidence. I do not mean that it lacks procedures there—after all, it lacks procedures everywhere. Rather, ECD has far too little to say about evidence itself.

How do we know what counts as good evidence? How do we recognize evidence? How do we avoid bad evidence?

ECD’s framework includes Evidence Statements (i.e., descriptions of what evidence of the claims might look like in action). And like Task Models and Domain Models, ECD does not explain what evidence statements should look like. That is left to the individual practitioners and/or project teams. But we’ve come to realize that there are some essential problems with this too vague view of Evidence Statements. 

Simple Evidence Statements have a number of significant weaknesses. This is why RTD suggest more robust evidence statements, and creating them in the context of strong Item Logic. 

First, simple evidence statements often mistake an absence of evidence for evidence of absence. Whether the purpose of an assessment is summative or formative, identifying what students do not know and cannot do is at least as important as identifying what they can do. In fact, with formative assessment, it is even more important. This is not about being negative, rather this is about being instructionally minded. It is important to be sure one does not confuse an absence of affirming evidence for the presence of disconfirming evidence.

Second, simple evidence statements are prone to Type II errors (false negatives). This is in part due to the absence of evidence problem, but it is also due to their inability to disentangle different causes for mistakes or errors.

Third, simple evidence statements are prone to Type I errors (false positives). That characteristic of student or test taker work could be present due to that particular knowledge, skill, and/or ability (KSA), but it might be because the test takers took an alternative path that did not depend on that KSA. 

Simple evidence statements likely work best in the context of some sort of portfolio assessment, in which raters are able to review a broader set of each student's or test taker’s work and look for larger trends and patterns. Taken together, the errors in that noisy data can cancel out and the signal of information can become apparent. This is really just a sample size issue; the noisier the data, the larger a sample size is needed.

However, neither formative assessment nor large scale standardized assessment has access to such large samples of a student's or test taker's work for each assessment target. Therefore, robust evidence statements are needed.

Robust evidence statements must include information about the context in which the evidence appears. What sort of directions or instructions prompted the work? Did they specifically ask for this sort of evidence, or did they merely provide an opportunity to develop it? How much scaffolding was present? Did the task allow for alternative paths? ECD talks about the evidentiary argument, and the importance of the source of and context for evidence is well known in that field of law. Assessment should take those issues just as seriously. 

The Value of Negative Information

One of the most famous Sherlock Holmes stories is Silver Blaze, the one in which he deduces the identity of the guilty party because a dog did not bark. Negative information is often worth as much as positive information.

There certainly have been times in my life, however, when I felt that those who judged me were more concerned with negative information that with positive information. They were more concerned with that I did not do than with what I did do. Or were more concerned with what I could not do than with what I could do. As a teacher, I certainly wanted to identify and celebrate what my students could do.

Positive information feels positive and perhaps celebratory. Negative information feels negative and perhaps even mean.

But negative information can be invaluable.

In some places, there is an explicit effort to focus on what students can do, rather than what they cannot do. This even gets to how some people and organizations talk about assessment. Rather than saying. “Assessments should identify what students can and cannot do,” they say “Assessments should identify what students know and can do.” I really do understand this urge. I used to be a teacher and I cared deeply about my students.

And yet, educators often say that what they really need is high quality formative assessments. That is, they need assessments that help them to provide instruction and to identify where students need more instruction. Of course, formative use of assessment requires times to go back and provide additional instruction and support to students after the assessment (and therefore timely results). But it also requires negative information. It requires tests and items that highlight what students do not know and what they cannot do.

This means that items used in formative assessment must be incredibly careful about false positive results (Type I errors). They cannot provide alternative paths to a successful response that avoids use of the targeted cognition. They cannot be so unstructured that it is not even clear what KSAs a test taker used to produce a successful response. Nor can they focus on the integration of skills such that it is not clear which KSAs broke down when test takers failed to produce a negative response.

The kinds of activities that I would rather teach with and want my students to be able to succeed with are not likely to be very useful for identifying what they need further help with.Yes, many students’ shortfalls might be clear, but many students will be able to steer around their weaknesses and lean further on their strengths. This kind of compensatory approach works to actively hide the information that formative assessments are intended to uncover.

The backlash against standardized tests is based upon many ideas, but one of them is surely that standardized tests and their results can feel mean. Such tests are often designed to designed to reveal shortcomings, deficits and lack of proficiency. Alternatives to traditional standardized tests, therefore often focus on the kinds of activities that I would rather teach my students with and with which I want my students to be successful. Such tasks seem to have more potential to feel celebratory. But such test simply cannot provide high quality information for formative purposes (and the information is not really valid for summative purposes either).

Formative assessment is just as demanding as summative assessment. It requires just as much skill and rigor to produce. Though we do not focus on formative assessment in our Rigorous Test Development (RTD) model of practice, just about everything in RTD applies to formative assessment as well as it does to summative assessment.

How and Why Plagiarism Matters in the Academy

Plagiarism is a very important issue in academia—far more important than in other contexts. This is a very different issue than copyright, which is about the law and perhaps money. Plagiarism is something else. 

Plagiarism is about using the ideas or the expression of ideas of someone else without crediting them for it. (I was taught long ago that it also includes the organization of ideas, but I have never seen that really developed.) It is not a matter of using someone else’s idea or words, rather it is using them in an uncredited fashion. The exact same behavior—even the same exact case—can be meaningless and harmless in other contexts, but a major violation in a scholarly context. 

There are two reasons for this. 

First, academia is about what I call the scholarly conversation. This is where we build on the work of others, crediting them for their contributions and then extending, applying or refuting them. Because we are building on the work of others, they have already shown how, why, where and in what circumstances those ideas apply, what they put together to get there, and perhaps laid out some caveats and/or restrictions. We do not replicate all of that work ourselves, because they have already done it. Unless our specific goal is to replicate their work—in order to verify it, perhaps in a new context—we should not try to replicate it. Instead, we give our readers a shortcut, by letting them know where they can find that earlier work.

This allows readers to evaluate the validity of the foundations we are building upon by considering the credibility of those earlier scholars. There are people whom I respect so much that I would likely just accept their conclusions, without needing to go and investigate how they got there. There are people whose work I have previously found so problematic that I do not trust anything built upon it. But most scholars? Well, if I am unsure about the meaning or breadth or application of an idea, I might want to go and learn more about it. Citations to others’ work allow me, the reader, to evaluate the precursors to the work I am reading—and to do so in the fashion I choose. 

Second, citing those who came before allows me to evaluate the scholar and work I am reading. If I can see that they know that they are building on the work of these previous scholars, I can better be assured that they have already considered or taken into account the issues that those previous scholars raised. I can see that they are, in fact, building on those other people’s work. This means that they should not be making mistakes already warned against, retreading on infertile ground, or simply doing more elementary work. If they show me that they already know that previous work, and how they are building on it, I can take them and their work more seriously. 

More subtly, by citing previous scholars, I can usually see the disciplinary, methodological and substantive direction that a work is coming from. It helps me to understand the kinds of concerns that will be explored, the kinds of tools that might be brought to bear and the classes of themes that might be recognized. That is, it gives me notice of what schema I should be activating so that I can more easily make more sense of what I am reading and will get to in this work. 

Now, both of these two reasons matter enormously for those with expertise to recognize and understand the citations in a work. They can seem like minor things to those who couldn’t make use of the citations for these sorts of purposes. But academic writing is aimed that just such an audience of experts. Obviously, this serves as a barrier to the larger public understanding academic work. This is why writing for the broader public is just so very different than writing scholarly works. But that is a different audience, and different audiences should be approached differently. 

Clearly, neither of these two reasons really addresses the importance of correctly indicating when someone else’s words are being used. That is mostly about just politeness. But there is value in clear and/or efficient expression of an idea. We ought to give credit, rather than steal credit, for well crafted explanations of ideas. But in student work, including doctoral dissertations, there is another very important reason to be a stickler for properly crediting the expression of ideas. You see, explaining something in your own words is often how you show that you actually understand what something means, or why it is important. This is why when quoting an extended passage—even the best written extended passages—it is still important to explain its significance. Yes, this helps the reader to pick out the parts you mean to build on, but (perhaps more importantly), it shows the reader that you actually understand what you are referring to. 

Because scholars in the academy therefore must be stringent on this issue with their students, it becomes incumbent upon them to model the behavior they expect from students in their own work. Even a mild paraphrase can be introduced with “As [scholar] explained,…”  And, honestly, I would feel taken advantage of if someone took credit for my phrasings (which I am sometimes proud of). I am so accustomed to giving credit to others in the scholarly community, I expect others to do the same with me.

With all of these reasons to correctly indicate the sources of the words in a piece and the sources of ideas in that the piece builds upon, why not give proper credit and correctly indicate quotations? I can hardly think of a respectable reason, leaving just laziness and sloppiness—which are hardly decent excuses. 

However, I would add that non-experts might not recognize when an expression is really just a standard way to explain an idea. In fact, most of my quantitative methodology classes focused a shocking amount of attention on how to explain in words what quantitative data, results and/or analysis signified. This was taken so seriously that if two people in the same class were give them same data, graph or statistical output, we could very easily independently write the exact same sentences to describe them, and our peers (and other experts) would immediately recognize what is—if not essentially boilerplate plate sort of language—the style what a particular group has been intentionally acculturated into using. I wish that my qualitative methodology courses were as careful about steering us clear of overstating or misstating what our data showed. 

I would also add that this blog is not written in a scholarly fashion or for a scholarly audience. While I sometimes write with lots of citations, I do that much less here. Different form for a different audience, with different expectations. However, I try hard to attribute quotations properly, even here.

Cognitive Complexity: Uncertainty and Deliberation

While cognitive complexity can describe many things, the RTD approach to cognitive complexity is firmly grounded in the assessment industry’s dominant model, Norman Webb’s Depth of Knowledge (wDOK). As we read it, the central thrust of wDOK is the continuum of deliberation-to-automaticity, with the greater cognitive work of more deliberative cognitive paths being more cognitively complex, and the lesser work of greater automaticity—often earned through practice and greater proficiency—being less cognitively complex. (No, this is not the only way to think about cognitive complexity, but we based our rDOK approach on wDOK because it is so dominant in the industry. See our writing on rDOK (revised Depth of Knowledge) to examine how we think this plays out in the various content areas.)

One of our colleagues, a former science educator and now science assessment expert, wisely asked about the relationship between uncertainty and deliberation. Well, there are many kinds of uncertainty, and not all of them are tied to the kind of deliberation that DOK is about. Nonetheless, uncertainty often does lead to greater deliberation and a more cognitive complex path.

  • There is the uncertainty of not even knowing where to start, or whether to start. That is not deliberation. That is just indecision—often paralyzing indecision. It is a general, and common, nervousness that can be a barrier to focused effort. Teachers and tutors are familiar with this and an important part of their role is to help their students to develop the confidence to overcome this kind of uncertainty and take that first step.

  • There is the uncertainty of lack of confidence in one’s execution, which can be entirely rational. Perhaps more people should have this, as it leads to various sorts of proofreading. That is, they review their work for little mistakes in execution, even though this does not include rethinking the whole approach they took. Math teachers say “Check your work,” meaning the the mathematics equivalent of proofreading. This uncertainly is not advanced deliberation, and the greater work it prompts is not indicative of great cognitive complexity. Rather, it is essentially repetition of earlier work.

  • There is the uncertainty of not being sure what to do next when in the middle of the problem, or even not being sure what to do first. That is, once past the paralysis that keeps one from even being able to truly try to make sense of the task, one might still be unsure about the first step. This question of “What do I do next?” is a form of deliberation. It can be answered simply by trying to remember the next step in a (perhaps poorly) memorized procedure. It might instead be answered by trying to (re)discover or (re)invent a good next step. This latter response constitutes reasoning and the kind of deliberation at the focus of both wDOK and rDOK. Indeed, uncertainty is often what creates the opportunity for deliberation. 

  • An even more careful deliberation can be prompted by initial uncertainty. One might try to figure out more than just the initial step, instead trying to work out a longer plan before diving into the work of the first step. This is not necessarily a different kind of uncertainty than mentioned above, but one’s response to it can be less or more carefully and deliberative—and therefore more cognitively complex. 

  • There is also a second kind of uncertainty after completing a task. One might ask oneself, “Was that even the right thing to do?” and revisit/question the reasoning that led to the steps taken. This differs from merely proofreading/checking one’s work, though both are prompted by uncertainty after the fact. Proofreading revisits execution, whereas this revisiting of reasoning is more cognitively complex.

Of course, one might be uncertain before a task and i) carefully develop a plan to help break through initial paralysis, ii) execute the plan, iii) revisit the reasoning of the plan but decide it was a good approach, and iv) when check one’s work. Uncertainly can drive all of this. All of that careful deliberation can still lead to bad plans poorly executed with errors that were missed when proofreading. No amount of deliberation can guarantee success, and the highly proficient can often achieve success without any conscious deliberation. 

Uncertainty can be a product of a range of factors. It might come from genuine ignorance or other lack of necessary skills. It can from insufficient practice or arise due to being faced with novel situations. It can be a product poor instruction or lack of effort to learn by a students. Some people are by character more confident and some are by character less confident—and either be justified or not in this. But regardless of the source of uncertainty, the question of cognitive complexity (i.e., either rDOK or wDOK) is answered by looking at the response to uncertainty. 

On the other hand, a lack of certainty obviously inhibits deliberation. It makes deliberation of any sort far less likely, which is often detrimental to producing high quality work. Ideally, intellectual humility would put an upper limit on certainty and lower limit on deliberation of various sorts.

Writing Multiple Choice Items is Harder than it Looks

Writing multiple choice (MC) items is extraordinarily more difficult than just writing good constructed response (CR) questions. CR questions give the test taker the freedom to take any path they want and to demonstrate their understandings and misunderstandings without the scaffolding and limitations of the set of answer options that are the distinguishing characteristic of MC items. 

When we use MC items, it is because for all the initial difficulty in writing and refining them, they are faster and cheaper to score than CR items. Machines have been able to do it quickly and for virtually no cost for decades. That’s why we used to use #2 pencils—that’s the kind of pencil that the machines were make to read most easily. That long term cost saving is truly the only reason to use them.

But what does it take to write an MC item that gathers the information that a CR item does? How do we avoid making the answer too obvious? How do we avoid the little hints that so often appear in MC items? Unfortunately, quite few researchers have really looked at the contents of items clearly enough to offer good guidance, but we do know that each distractor (i.e., incorrect answer option) should represent the result of a misunderstanding or misapplication of whatever knowledge, skills and/or abilities that that item is trying to assess. 

So, let us consider a very simple question: How many states make up the United States of America?

We know that key (i.e., the correct answer) is 50. But can we come up with three or four good distractors?

It seems plausible that some significant share of test takers who get this question wrong will come up with 13, confusing the number of original colonies that rebelled against the British and formed the original collection of state. 13 is clearly a good distractor. 

What are the other mistakes that someone might make? Can you think of any? I don’t think that any other key American government numbers are likely mistakes. 

  • 3 (i.e., branches of government) is too obviously wrong and too low. 

  • 538 (i.e. the number of members in the electoral college) is too obscure and too advanced. Anyone who even knows that that is a significant number clearly will know the number of states.

  • 435 (i.e., the number of members of the house of representatives) is not quite so advanced as 538, but suffers similar problems. 

  • 27 (i.e., the number of constitutional amendments) again suffers similar problems. 

So, we still need two or three or more distractors. What about sheer guessing? I expect that most people who would get this question wrong would simply not know an answer and would simply guess. What are likely guesses? 

I think that 100 is a likely guess, and perhaps attractive to someone who doesn’t know the real answer. I’m concerned that it’s a little large, but a nice round number doesn’t seem crazy. 

But we still need at least one more distractor, at least. Something smaller than 100—which feels a little large. Maybe not a round number, so it feels more precise. Mexico has 31 states, so that might be particularly attractive to Mexican migrants. Of course, that might raise a fairness issue. Maybe they’d be more likely to pick it because they recognize the number, or maybe less likely because they know it is only Mexico’s answer? I’m going to ignore that for now. 

How many states make up the United States of America?

A. 13

B. 31

C. 50

D 100

Now, I don’t love that. I’m always nervous that test takers are more likely to pick middle values (i.e., under the goldilocks principle of juuuuuuust right). Obviously, though, we ought to use every answer position equally when placing the key. I suspect that item developers too often try to offer distractors are smaller and that are greater than the key, and I don’t want to fall into a too common pattern.We could replace 100 with 25, though that is not so round a number, and therefore perhaps not as attractive for guessing. 

How many states make up the United States of America?

A. 13

B. 31

C. 25

D 50

Such a simple question, and with one clearly correct answer. And yet, it’s not obvious which set of distractors is better. We would love it if we could offer good distractors to attract as many guessers and other mistakes as possible, so it just identifies the test takers who really do know the correct answer.

Writing high quality multiple choice items is hard. 


Rise in Absenteeism, Drop in Exams

Two related stories caught my eye this week.

First, the New York Times reports that State of New York might be dropping its quite longstanding end of course (EOC) Regents Exams as a graduation requirement. I first heard of these tests back in the mid-80’s, though I then lived in the DC suburbs. I later taught in New York, and became quite familiar with them. This story also mentions dropping the multiple different sorts of high school diplomas available in New York down to a single diploma, which is related to the use of the Regents Exams, as passing additional Regents Exams (e.g., more science?) was a major requirement for the higher level diplomas.

Second, there has been a steep increase in student absences since before the pandemic. I saw a chart from a DC school district report that showed how bad it’s gotten there, with nearly 50% of students missing at least 10% of school.

That is worse at the high school level, nearly 2/3 of student missing at least 10% of school days and over 25% missing at least 30% of school days. Anyone who has spent days digging out of the hole created from missing a couple days for vacation, due to sickness or just because of a work trip easily understands that the impact of missing a day or two of class stretches far beyond simply the days missed. 

A White House Council of Economic Advisors’ blog post pointed to a larger national study of the pandemic’s impact on absenteeism. It shows nearly a doubling of chronic absenteeism across the country. 

Of course, absenteeism is not evenly distributed across all schools. This is a greater problem in lower SES schools, and often in more minority and English Language Learner dominated schools. This is why DC’s numbers look particularly bad, as they do not include schools from the (more affluent) suburbs.

These stories feel connected to me. I know from my own teaching experience—both in suburban schools and in inner-city schools—that the kids who failed to graduate overwhelmingly were kids who had major attendance problems at least as far back as 9th grade. Most of the kids who had the most trouble passing the required Regents Exams came from the same group of kids, or they were kids who came to school but did not do homework or pay attention—so they were physically present but were not engaged in learning. 

Why are we shifting away from standardized tests? I understand they do not do a good enough job measuring student proficiency with the state learning standards, but I did not see kids who really had those proficiencies and yet could not pass the exams. The bar on the required exams simply was not that high. Yes, there were occasional years when the Physics Regents Exam as outrageously difficult, but that was not a required exam for graduation. Yes, we need better exams, but what do we accomplish by moving away from them?

I am just troubled that we are removing the best information for voters, tax payers and community members about the academic performance of the schools, perhaps largely due to wanting to stick our heads in sand. We don’t want to pay attention to the degree of disparities across communities—disparities that have been exacerbated in the last few years. Senator Ted Kennedy agreed to an increase in standardized testing in order to shine more light on the disparities, and I find that reasoning compelling. I don’t see how dumping tests will help students who need the most help or (are supposed to) attend our lowest performing schools. 

What is wrong with the Haladyna Item Writing Rules?

Haladyna et al.’s item writing rules project might have begun as merely a literature review, but by the time any of it was first published (Haladyna & Downing, 1989), it claimed to be “as a complete and authoritative set of guidelines for writing multiple-choice items.”  The 2002 update (Haladyna, Downing & Rodriguez)—the last that claims to be grounded in the work of others—says “for classroom assessment” in the title, but then says, “This taxonomy may also have usefulness for developing test items for large-scale assessments,” in the article’s abstract. Their project has always been quite ambitious, intended to be far more than merely a literature review.

I have no problem with this ambition. The first fundamental problem with the Haladyna Rules is the arrogance. They do not explain reasoning for each rule, and sometimes even go against the literature. Ambition is offering such expansive guidance, arrogance is doing so when you’ve never actually done the work professionally, yourself. Ambition is trying to cover both classroom assessment and large scale standardized assessment, arrogance is mistaking your own opinion and/or preferences for reasoning and/or evidence.

The second fundamental problem with the Haladyna Rules project is that it displays profound ignorance about how items work or what their purpose is. Yes, the two articles (1989, 2002) are rather oriented towards reporting on what others’ have written, but their two major books (2004, 2013) are not. In either case, they could have explained their rules in terms of item functionality and purpose, but they do not. Are these rules focused on items ability to provide the intended information? Or, as we say so very often, valid items elicit evidence of the targeted cognition for the range of typical test takers. Their rules are focused on clarity and even presentation. The feel incredibly shallow, as though little consideration is given to content, and only a hand wave to test takers.

The third problem with the Haladyna rules is their organization. Content, Formatting, Style, Stem, and Choices is simply not a good structure, and they do not even follow it consistently. Some ideas are listed separately in the Stem and Choices sections, and some are listed elsewhere to apply to both parts of an item. The cluing rules uniquely has subparts, but oddly does not include all of the cluing rules/guidelines. This looks like they imposed their own latent thinking, but not mindfully enough to do it well. It’s a mess. Perhaps they hould have a section on Cluing and a section on Clarity, as these themes are common in their list.

The fourth problem with the Haladyna Rules is that it is just sloppy. Despite all the iterations, there are a number of pairs of redundant rules. 13 & 16, 11 & 12, 17 & 27, 29 & 30, 14 & 15, and 23 & 24. Each of these pairs clearly should be consolidated into a single rule, and it is inexplicable to me that they were not. Moreover, the articles offer explanations for just some of their rules, not for all of them. Yes, by 2013 it is less sloppy, but most of the redundancies remain.

From my perspective—from the perspective of me and my frequent co-author—the deepest problem is that lack of understanding of what makes for a high quality item. Items must align with their assessment targets; they should measure what they are supposed to measure, and not something else in the neighborhood. That idea of alignment seems absent from these rules. And for items to be aligned, they must be aligned for the range of test takers, not just the ones like oneself or one’s own students. This is the idea of formal/technical fairness. But their project only has a small hint of that concern, and it is not explicit in any of the actual rules. This is what we refer to as content and cognition.

So, this approach to “Item-Writing Guidelines/Rules/Suggestions/Advice,” as they termed it, was destined to fail. Rather than helping teachers, item writers and/or content development professionals by helping with the hardest parts of item development, they focus on the inane (e.g., “vertically instead of horizontally”), the obvious (e.g., correct spelling), the tautological (e.g., “avoid over specific and over general content”), the false (e.g., “the length of choices about equal”), and the shallow (i.e., just about everything else). It does not acknowledge the importance of item stimulus (e.g., reading passages), or even their existence! Even when there is a decent kernel of an idea, they fall short—usually woefully short.

Could a good list of item writing principles have come out of this project? I don’t know. I really don’t. I do not understand turning to people who lack deep experience in item development and lack a deep understanding of how items function, but it appears that that is the basis for their project. Of course, I understand that that is how scholarship and literature reviews work, but they literally claim that this resulted in a “complete and authoritative” list. Furthermore, their later works do not present this stuff as merely a literature review; these are their guidelines for writing items.

My colleagues and I have seen how these rules have become embedded in the field, in item writing manuals, in style guides, in educational assessment works of scholarship and of instruction. And they are bad. You can review my daily posts last month to see me take each of them down—all but Rule 14—but that’s not entirely necessary. The point is that these rules might help to develop shiny polished item that look good, but they do not really help to develop highly refined items that elicit evidence of the targeted cognition for the range of typical test takers. And it appears that their lists mere existence might have hampered the development of better item writing guidance—what else could be responsible.

Now, the field of educational measurement’s focus on use of item difficulty and item discrimination statistics to evaluate items and to evaluate item writing rules has not helped. This approach undermines domain models and both content- and construct-validity. It renders claims of evidence from test content less than credible, because it bends what are supposed to be criteria-referenced tests into more effective norm-referenced tests. But that is no excuse for this mess. Haladyna et al. themselves are fully on board with this approach to evaluating items, and approvingly refer to such studies of item writing rules in their own work. But none of that explains why they the field has not dived more deeply into what makes for an actually useful item.

Instead, we have the Haladyna Rules—which definitely need to be replaced.

 

Fisking the Haladyna Rules: The Complete List

Fisking the Haladyna Rules #31: Use humor sparingly

[Each day in October, I analyze one of the 31 item writing rules from Haladyna, Downing and Rodriquez (2002), the super-dominant list of item authoring guidelines.]

Writing the choices: Use humor if it is compatible with the teacher and the learning environment.

And now, the last Haladyna et al. rule. Or guideline. Whatever you want to call it. All their library research to compile the consensus of textbooks, researchers and other authors ends with this one. Is it a good rule?

Ha! This is a bad rule i) by their own official standards, ii) because they seem to contradict themselves and iii) because it their explanation shows how little they understand that challenges of item development.  

First, their 2002 article shows that they do not have even a single source that says that humor is a problem in items. Not one! Only 15% of their sources even mention the idea, and none favor their rule. There is no consensus here, and among the small number of sources who address it, they kinda say, “Meh, whatever.”

Second, their fuller statement of their rule contradicts their shorter statement. That is, their Table 1 says, “Use humor if it is compatible with the teacher and the learning environment,” but their Table 2 says “Use humor sparingly.” Do these two statements mean the same thing? Should it be used sparingly, or is it ok? The longer version seems to be generally supportive, and the shorter version seems generally unsupportive. How is this a rule or guideline or advice? Do they mean that it is ok for classroom assessment but not  for large scale standardized assessment? I can imagine that advice, but that is not what they offer. I can’t tell what they are offering.

Third, the worst thing about this rule is that it is teacher-centric rather than test taker- or student-centric. Test takers vary, and in an enormous number of ways. The problem with humor is that not everyone agrees on what is funny. People have different senses of humor, and attempts at humor in stressful or important situations is only a good idea if everyone gets the joke. But the less homogenous the group, the less likely everyone is to agree that something was actually funny. Haladyna et al. seem neither to understand this basic fact about humor, nor the basic fact about test taker variation—which is essentially the issue of Fairness.

To try to explain this rule, they offer an example. A bad example. It is supposed to be funny, but instead it’s just confusing. And it is double keyed. So, the problem is not the use of humor; the problem is that it is double keyed! In fact, I seriously question whether the example is actually humorous.

In Phoenix, Arizona, you cannot take a picture of a man with a wooden leg. Why not?

A.        Because you have to use a camera to take a picture

B.        A wooden leg does not take pictures

C.        That’s Phoenix for you

There are ­so-oooo many problems with this item, including by their own rules. Not only is it double keyed,  but option C does not even pretend to answer the question. The answer options are not parallel in grammatical structure, they vary enormously in length. It’s not at all clear what the targeted cognition even is. There’s a negative in the stem that is not highlighted in any way. Heck, this is an item that would actually benefit from adding “D. All of the Above.”

Is the problem humor? Is the problem that it attempts humor? Of course  not! It is a bad item for so many other reasons that have nothing to do with humor.

Is this their best argument? Yeah, I suppose it is. Afterall, between even just these four versions of their list of rules (i.e., 1989, 2002, 2004 and 2013), this example is from the final one. They always cautioned against the use of humor, and this was their refined reasoning.

It is the least supported, least defensible and worst rule in their list. Maybe not the dumbest—if only because writing funny is just hard!—but still the worst. Sure, there’s a good reason to avoid trying to use humor, but they do not even wave in that direction.

[Haladyna et al.’s exercise started with a pair of 1989 articles, and continued in a 2004 book and a 2013 book. But the 2002 list is the easiest and cheapest to read (see the linked article, which is freely downloadable) and it is the only version that includes a well formatted one-page version of the rules. Therefore, it is the central version that I am taking apart, rule by rule, pointing out how horrendously bad this list is and how little it helps actual item development. If we are going to have good standardized tests, the items need to be better, and this list’s place as the dominant item writing advice only makes that far less likely to happen.

Haladyna Lists and Explanations

  • Haladyna, T. M. (2004). Developing and validating multiple-choice test items. Routledge.

  • Haladyna, T. M., & Rodriguez, M. C. (2013). Developing and validating test items. Routledge.

  • Haladyna, T., Downing, S. and Rodriguez, M. (2002). A Review of Multiple-Choice Item-Writing Guidelines for Classroom Assessment. Applied Measurement in Education. 15(3), 309-334

  • Haladyna, T.M. and Downing, S.M. (1989). Taxonomy of Multiple Choice Item-Writing Rules. Applied Measurement in Education, 2 (1), 37-50

  • Haladyna, T. M., & Downing, S. M. (1989). Validity of a taxonomy of multiple-choice item-writing rules. Applied measurement in education, 2(1), 51-78.

  • Haladyna, T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied measurement in education, 15(3), 309-333.

]

Fisking the Haladyna Rules #30: Use common errors of students

[Each day in October, I analyze one of the 31 item writing rules from Haladyna, Downing and Rodriquez (2002), the super-dominant list of item authoring guidelines.]

Writing the choices: Use typical errors of students to write your distractors.

In 1989, 2002 and 2004, this rule and Rule 29 (Make all distractors plausible) were distinct.  

Their 2013 book finally combines these two rules. It puts them together into one heading, preserving the wording of each and simply separating them with a semi-colon. They finally get it right. Their explanation contains something really good. Something really really good.

The most effective way to develop plausible distractors is to either obtain or know what typical learners will be thinking when the stem of the item is presented to them. We refer to this concept as a common error. Knowing common errors can come from a good understanding of teaching and learning for a specific grade level [or other listed methods].

Finally. A decade after the 2002 article and more than two decades after the 1989 article, near the end of the list, after the multi-part cluing rule, they get to the real meat. Unfortunately, all of that just shows that they have no clue how important this rule is. This is what matters most. Distractors are the defining feature of multiple choice and other selected response item types, and this finally gets near the core of what makes for a high quality distractor.

For multiple choice items to elicit high quality evidence, they must be able to offer credible affirmative evidence (i.e., that the test taker does have proficiency with the targeted cognition) and also be able to offer credible negative evidence (i.e., that the test taker lacks proficiency with the targeted cognition). These two sort of evidence are built in two different ways.

Affirmative evidence comes from items that require a cognitive path that depends on the targeted cognition to produce a successful response. All that cluing stuff Haladyna et al. keep coming back to? That is about alternative paths to a correct response via test taking savvy instead of through use of the targeted cognition.

Negative evidence is harder to collect. Negative evidence, as Rule 30 implies and their 2016 book says more clearly, requires offering potential responses that test takers might actually work to if they misunderstand or misapply the targeted cognition—that is, legitimate results of authentic mistakes. Any other distractor is a waste of everyone’s time. Only a guesser would select it, and that does not tell anyone anything—other than, perhaps, the test taker didn’t even try to actually work through the item. If a mistaken test taker cannot find their own result among the distractors (i.e., because distractors are wasted on some other basis), they are clued to try again. Rather than gathering evidence of their mistake (or mistaken understanding) they are given a second chance. But other sorts of mistakes might not be given such a clue or chance, as when they do have corresponding distractors. That is why it is important that distractors always and only be based on common test takers mistakes.

A substantive and valid meaning for “effective” distractors would be Substantively distractors that actually gather negative evidence of proficiency by giving the most common mistakes with the targeted cognition corresponding answer options. Now, if there is really only one mistake that test takers make with this particular piece of knowledge or skill, then no one should expect more than one effective distractor. But if there is a common mistake that test takers make with a problem but it is not a mistake with the targeted cognition, then it is not a good or effective distractor! Such a distractor would suggest that test takers lack proficiency with the targeted cognition when, in fact what they lack is with other knowledge and/or skills.

Yes, test takers who make other mistakes should be clued to correct those mistakes, because items should not collect information about other cognition. That is, the common mistakes that are relevant are only the ones in understanding or applying the targeted cognition, even if they are not the most common mistakes, overall.

Yeah, this is about item focus. Should an item purported to be aligned with some specific targeted cognition confuse information about other cognition with information about the targeted cognition? Of course not! Other sorts of mistakes should not prevent test takers from being successful, and other skills (e.g., test taking savvy) should not be enough to enable success.

Substantively, effective distractors capture evidence of the lack of targeted proficiency. Anything else is ineffective, regardless of how often test takers select it. And ineffective distractors undermine item quality and every validity claim about a test.

This is the hardest thing about writing high quality items. Developing items that lack alternative paths is hard, and made harder by inappropriate test prep that stresses shortcuts for the particular items on a test over authentic use of the targeted cognition. Developing a full set of distractors is even harder. As Haladyna et al. finally explained in 2013, it really benefits from knowing about teaching and learning of the targeted content. It is made harder because teachers and other educators are always trying to improve teaching and learning, meaning that the most common mistakes or misunderstanding scan shift over time as educators address the most common one they see.

The lure of substantively ineffective distractors that nonetheless masquerade as quantitatively effective distractors (i.e., by popularity) comes in the form of distractors based on other kinds of mistakes, rather than mistakes with the targeted cognition. These can be used to raise or reduce observed empirical item difficulty, and often will not be caught by item discrimination statistics. Haladyna et al. do not understand this, which is why even though Rules 29 and 30 start to get into the meat of what a good item is, even they fall short.

Thus, if they can be taken together, Rule 29/30 are perhaps the most important rule(s) and yet as Haladyna et al present them, they still are not good.

 

[Haladyna et al.’s exercise started with a pair of 1989 articles, and continued in a 2004 book and a 2013 book. But the 2002 list is the easiest and cheapest to read (see the linked article, which is freely downloadable) and it is the only version that includes a well formatted one-page version of the rules. Therefore, it is the central version that I am taking apart, rule by rule, pointing out how horrendously bad this list is and how little it helps actual item development. If we are going to have good standardized tests, the items need to be better, and this list’s place as the dominant item writing advice only makes that far less likely to happen.

Haladyna Lists and Explanations

  • Haladyna, T. M. (2004). Developing and validating multiple-choice test items. Routledge.

  • Haladyna, T. M., & Rodriguez, M. C. (2013). Developing and validating test items. Routledge.

  • Haladyna, T., Downing, S. and Rodriguez, M. (2002). A Review of Multiple-Choice Item-Writing Guidelines for Classroom Assessment. Applied Measurement in Education. 15(3), 309-334

  • Haladyna, T.M. and Downing, S.M. (1989). Taxonomy of Multiple Choice Item-Writing Rules. Applied Measurement in Education, 2 (1), 37-50

  • Haladyna, T. M., & Downing, S. M. (1989). Validity of a taxonomy of multiple-choice item-writing rules. Applied measurement in education, 2(1), 51-78.

  • Haladyna, T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied measurement in education, 15(3), 309-333.

]

Fisking the Haladyna Rules #29: Make distractors plausible

[Each day in October, I analyze one of the 31 item writing rules from Haladyna, Downing and Rodriquez (2002), the super-dominant list of item authoring guidelines.]

Writing the choices: Make all distractors plausible.

This might be the most important principle in all of multiple choice (MC) item development, and that makes it incredibly important to all of large scale standardized assessment because of the dominance of MC items on standardized test. But Haladyna et al. fail to explain what makes a distractor plausible in their 2002 article. Note that there is a different rule about basing distractors on common test taker mistakes (i.e., Rule 30), so it cannot be that.  

Their 2004 book provides a brief explanation, but still separates test taker errors from plausibility. They write that distractors should “look like a right answer to those who lack this knowledge” (p. 120). My regular co-author and I call that shallow plausibility. That is, those who lack the desired proficiency cannot easily and quickly dismiss it as incorrect. This idea of shallow plausibility undermines (or subsumes) most of Haladyna et al.’s advice on cluing (be it part of Rule 28 or any other rule) because it entirely shifts the issue into a different frame. Like their 2002 article, their 2004 book appears to equate “plausible” with “effective” and suggests that these are evaluated by judging how many test takers select them.

But is that a decent standard to judge items and effectiveness? If you care about validity—about content validity, construct validity, or validity evidence from test content—then it clearly not is not a decent standard.

Items aligned to easier assessment targets should be easier. Fewer test takers should select distractors for those items. Of course, that just begs the question of what “easier” means. Well, for assessment purposes, easier is not an intrinsic quality of the targeted cognition. Rather, it is about an interaction between the content, teaching and learning of the content and the item—all in test takers’ heads. When instruction improves (e.g., through better curriculum, better lesson plans or better pedagogy), measured content difficulty should drop. If some school, district or state does a better job of teaching some standard, the distractors don’t get less effective. Rather, more test takers are able to produce a successful response. Better teaching does not make items or distractors less effective simply because fewer test takers select an incorrect option.

This simply a dumb way to think about distractor effectiveness. Truly dumb. The question is not whether distractors are selected by many test takers, but rather whether these distractors (as opposed to other potential distractors) are the ones that will be fairly selected by the most test takers. But to understand what that means, you’ll have to read tomorrow’s post.

But, frankly, this idea that distractors should be judged in a sort of popularity contest is what leads to the kinds of deception and minutia that Haladyna et al. try to warn against in Rule 7 (Avoid trick items). If the best you can do when writing distractors is to try to deceive test takers, you are not trying to measure the targeted cognition at all. Dumb Rule 7 only exists because of this idea that items should be difficult, rather than that they should be fair.

[Haladyna et al.’s exercise started with a pair of 1989 articles, and continued in a 2004 book and a 2013 book. But the 2002 list is the easiest and cheapest to read (see the linked article, which is freely downloadable) and it is the only version that includes a well formatted one-page version of the rules. Therefore, it is the central version that I am taking apart, rule by rule, pointing out how horrendously bad this list is and how little it helps actual item development. If we are going to have good standardized tests, the items need to be better, and this list’s place as the dominant item writing advice only makes that far less likely to happen.

Haladyna Lists and Explanations

  • Haladyna, T. M. (2004). Developing and validating multiple-choice test items. Routledge.

  • Haladyna, T. M., & Rodriguez, M. C. (2013). Developing and validating test items. Routledge.

  • Haladyna, T., Downing, S. and Rodriguez, M. (2002). A Review of Multiple-Choice Item-Writing Guidelines for Classroom Assessment. Applied Measurement in Education. 15(3), 309-334

  • Haladyna, T.M. and Downing, S.M. (1989). Taxonomy of Multiple Choice Item-Writing Rules. Applied Measurement in Education, 2 (1), 37-50

  • Haladyna, T. M., & Downing, S. M. (1989). Validity of a taxonomy of multiple-choice item-writing rules. Applied measurement in education, 2(1), 51-78.

  • Haladyna, T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied measurement in education, 15(3), 309-333.

]

Fisking the Haladyna Rules #28: Avoid clues

[Each day in October, I analyze one of the 31 item writing rules from Haladyna, Downing and Rodriquez (2002), the super-dominant list of item authoring guidelines.]

Writing the choices: Avoid giving clues to the right answer, such as

a. Specific determiners including always, never, completely, and absolutely.

b. Clang associations, choices identical to or resembling words in the stem.

c. Grammatical inconsistencies that cue the test-taker to the correct choice.

d. Conspicuous correct choice.

e. Pairs or triplets of options that clue the test-taker to the correct choice.

f. Blatantly absurd, ridiculous options.

This rule is huge. As is, it has six parts. In their 2004 book, this rule has five parts, and in their 2016 book, this rule has four parts from this list of six and two part from elsewhere. So, that prompts a question of how I should go about responding to this rule. Some previous rules seem like they should just be sub-parts of this one. But this is a fisking project, so I will address everything.

Item developers should not fall into patterns that give away the correct answers. That’s the real principle for their first sub-part, not just particular words. Always, never, completely, and absolutely should only be used in patterns that do not flag that option as correct or incorrect—just like all of the above and none of the above and countless other potential tells. This is not merely about their use in any particular item, but rather pattens in their use across item banks. It certainly should not lead to a prohibition that has nothing to do with how well these terms address content.

I have never understood Haladyna et al.’s advice on “clang association.” They seem to be saying that items should not repeat key words from the stem either in the correct answer option or in incorrect answer options. That does not make sense to me. Why not? They offer that this can simply be too big a clue to the correct answer option—which seems just to be part D of this rule—or it can be a sign of a “trick” item. But I already addressed their dumb Rule 7 about trick items. I do not believe in trick items. Moreover, if some word in a title or quote actually is often misunderstood or mislead, then that sounds like it is a good basis for a distractor. It should not be avoided.

Isn’t just grammatical inconsistency a repetition of Rule 23? See my response to that rule, from earlier in the month. This sub-rule is folded into option homogeneity in their 2013 book. Length is also included in their 2013 book, but as a separate subpart.

Conspicuous correct choice? I’m not really sure what that means. That sounds more like an issue with a lack of plausible distractors, which might explain why this sub-part is missing from their books.

Pairs and triplets? Just another example of not understanding homogeneity, Rule 23. I already addressed that.

Blatantly ridiculous options? Yeah, that’s again about plausible distractors. That is its own issue, and perhaps the most important single principle of multiple choice item construction, right up there with clarity. It is not just about cuing, nor is it appropriate to bury in some sub-part of a rule on cluing. So, this one gets its due attention in tomorrow’s post.

Where does that leave us? I and my colleagues worry about false-positive results. We worry about those alternative paths, including encouraging guessing with various tells. But this is not a good list of tells to worry about. It is not even Haladyna et al.’s complete list of tells, so what is this rule doing?

What is this rule doing? They report that 96% of the sources for their 2002 article support it, but is that any surprise? They have six parts that a source could support to be included! Why aren’t these listed as six different rules so we can see how many sources mention each sub-part, how many supported it and how many ignore it? Are we to believe that each of the 96% mentioned all six parts? Certainly not! So, what is going on here?

No, this is a not a great rule. It misses the point, confusing symptom and outcomes for the actual principle at stake. As so many other reason, this rule makes clear that this list is not about deep principles.

[Haladyna et al.’s exercise started with a pair of 1989 articles, and continued in a 2004 book and a 2013 book. But the 2002 list is the easiest and cheapest to read (see the linked article, which is freely downloadable) and it is the only version that includes a well formatted one-page version of the rules. Therefore, it is the central version that I am taking apart, rule by rule, pointing out how horrendously bad this list is and how little it helps actual item development. If we are going to have good standardized tests, the items need to be better, and this list’s place as the dominant item writing advice only makes that far less likely to happen.

Haladyna Lists and Explanations

  • Haladyna, T. M. (2004). Developing and validating multiple-choice test items. Routledge.

  • Haladyna, T. M., & Rodriguez, M. C. (2013). Developing and validating test items. Routledge.

  • Haladyna, T., Downing, S. and Rodriguez, M. (2002). A Review of Multiple-Choice Item-Writing Guidelines for Classroom Assessment. Applied Measurement in Education. 15(3), 309-334

  • Haladyna, T.M. and Downing, S.M. (1989). Taxonomy of Multiple Choice Item-Writing Rules. Applied Measurement in Education, 2 (1), 37-50

  • Haladyna, T. M., & Downing, S. M. (1989). Validity of a taxonomy of multiple-choice item-writing rules. Applied measurement in education, 2(1), 51-78.

  • Haladyna, T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied measurement in education, 15(3), 309-333.

]

Fisking the Haladyna Rules #27: Avoid NOT in choices

[Each day in October, I analyze one of the 31 item writing rules from Haladyna, Downing and Rodriquez (2002), the super-dominant list of item authoring guidelines.]

Writing the choices: Phrase choices positively; avoid negatives such as NOT.

This is yet another redundant rule. Rule 17 (Word the stem positively, avoid negatives such as NOT or EXCEPT. If negative words are used, use the word cautiously and always ensure that the word appears capitalized and boldface.) covers the same ground, and though their 2002 article offers some explanation for Rule 17, it offers nothing on this rule.

So, I feel the same way about this rule. That is, if using the word not or some other negating word, be sure to set it in bold and underlined. I do not understand why they offer an out for stems (i.e., bold and all caps), but not for answer options. Their 2004 book also offers that advice, and rewrites Rule 17 to avoid putting that in the rule itself. That seems superior to this 2002 version.

Thus, in their 2004, it is clear that these really are more guidelines or advice than they are rules. Their 1989 articles calls them “rules” dozens of times—including in the title—and says that it is “a complete and authoritative set of guidelines for writing multiple-choice items” (p. 37). The 2002 article does not call them rules, leaning on the word “guidelines.” They end with wisdom from the 1951 edition of Lindquist’s (editor) handbook, Educational Measurement. They quote Ebel, rather than Linquist’s own brilliant chapter.

Each item as it is being written presents new problems and new opportunities. Just as there can be no set formulas for producing a good story or a good painting, so there can be no set of rules that will guarantee the production of good test items. Principles can be established and suggestions offered, but it is the item writer’s judgment in the application (and occasional disregard) of these principles and suggestions that determines whether good items or mediocre ones will be produced. (p. 185)

Yes, this is a great quote. Yes, there is actual wisdom in there. But I do not buy for second that Haladyna et al. believe this. Rather, it feels like too little/too late. They are 20 (of 22) pages in before they use the word principle and this article offers rather (or very) little to help item developers to develop that critical professional judgment. Guidelines without deep and thoughtful explanations have no chance to be understood as true principles. Their approach to presenting these ideas invites them to be understood as rules. Including something like Rule 6 (Avoid opinion-based items) and claiming that it is supported unanimously—though it has just 26% support from their sources— and failing to offer any explanation for it is clearly not an effort to support professional judgement in the application of worthy principles. Offering all those numbers in their Table 2 (p. 314) without offering real explanation is about leaning into the misleading precision of numbers to bolster the seriousness and credibility of everything on their list. Their 2002 article claims 24 of these rules have “Unanimous Author Endorsements” and these 24 rules average mere mention by less than two-third of their sources. Why make such a claim if not to suggest that these are truly rules?

Yeah, I don’t like negatives in answer options, but I don’t think that I can defend a prohibition or even discouragement. Clarity and simplicity of language are good goals, and—as I wrote above—put negative or negating words in bold and underlined type to make sure that test takers don’t miss them. This was all addressed in Rule 17. So, there’s no new principle here in Rule 27, not from me and not from them.

[Haladyna et al.’s exercise started with a pair of 1989 articles, and continued in a 2004 book and a 2013 book. But the 2002 list is the easiest and cheapest to read (see the linked article, which is freely downloadable) and it is the only version that includes a well formatted one-page version of the rules. Therefore, it is the central version that I am taking apart, rule by rule, pointing out how horrendously bad this list is and how little it helps actual item development. If we are going to have good standardized tests, the items need to be better, and this list’s place as the dominant item writing advice only makes that far less likely to happen.

Haladyna Lists and Explanations

  • Haladyna, T. M. (2004). Developing and validating multiple-choice test items. Routledge.

  • Haladyna, T. M., & Rodriguez, M. C. (2013). Developing and validating test items. Routledge.

  • Haladyna, T., Downing, S. and Rodriguez, M. (2002). A Review of Multiple-Choice Item-Writing Guidelines for Classroom Assessment. Applied Measurement in Education. 15(3), 309-334

  • Haladyna, T.M. and Downing, S.M. (1989). Taxonomy of Multiple Choice Item-Writing Rules. Applied Measurement in Education, 2 (1), 37-50

  • Haladyna, T. M., & Downing, S. M. (1989). Validity of a taxonomy of multiple-choice item-writing rules. Applied measurement in education, 2(1), 51-78.

  • Haladyna, T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied measurement in education, 15(3), 309-333.

]

Fisking the Haladyna Rules #26: Avoid All of the above

[Each day in October, I analyze one of the 31 item writing rules from Haladyna, Downing and Rodriquez (2002), the super-dominant list of item authoring guidelines.]

Writing the choices: Avoid All-of-the-above.

This rule and Rule 25 (None-of-the-above should be used carefully) are the two most opposed rules (by their own sources) on the Haladyna lists, though the explicit opposition to this rule is half that of Rule 25. To be fair, 70% of their 2002 sources support this rule, though Haladyna et al.’s offered reasoning seems a bit weak.

Their all of the above analysis cites use of this answer option making items less difficult, but their analysis of Rule 25 (none of the above) expresses concern that it makes items more difficult. Were they simply reporting on the literature, these differing results would be just different results for different phrases. But as they are offering their views in their actual recommendations, guidelines or rules, it is not even clear why a phrases impact on item difficulty automatically makes it objectionable.

The basis for this rule seems to be that when all of the above is included as an answer option that it is far far far too likely to be the correct answer option (i.e., the key). That is not a reason to avoid it, but rather a reason to use as a distractor more often. Test takers and teachers and test preparation tutors would quickly learn that it is no longer a dead giveaway—wisdom that I heard decades ago.

Their 2004 book suggests two ways to avoid all of the above. First, “ensure that there is one and only one correct answer options.” Yeah, duh. That might limit the nature of content that could be included, so I don’t favor that. Their other advice is to turn the simple multiple choice item into a multiple true-false (MTF) item. That is a much much better idea. Provided that the testing platform allows for MTF items, they should probably be used more often. Yes, they can take more time than simple multiple choice items, but they can delve deeper into various facets of an idea. Anything that helps constructed response tests to assess more deeply is a very good thing.

So, what do I think of this rule? I think greater use of MTF items would be a positive change. Otherwise, I would rather all of the above be used far more often as a distractor than it be abandoned for use as a key.

[Haladyna et al.’s exercise started with a pair of 1989 articles, and continued in a 2004 book and a 2013 book. But the 2002 list is the easiest and cheapest to read (see the linked article, which is freely downloadable) and it is the only version that includes a well formatted one-page version of the rules. Therefore, it is the central version that I am taking apart, rule by rule, pointing out how horrendously bad this list is and how little it helps actual item development. If we are going to have good standardized tests, the items need to be better, and this list’s place as the dominant item writing advice only makes that far less likely to happen.

Haladyna Lists and Explanations

  • Haladyna, T. M. (2004). Developing and validating multiple-choice test items. Routledge.

  • Haladyna, T. M., & Rodriguez, M. C. (2013). Developing and validating test items. Routledge.

  • Haladyna, T., Downing, S. and Rodriguez, M. (2002). A Review of Multiple-Choice Item-Writing Guidelines for Classroom Assessment. Applied Measurement in Education. 15(3), 309-334

  • Haladyna, T.M. and Downing, S.M. (1989). Taxonomy of Multiple Choice Item-Writing Rules. Applied Measurement in Education, 2 (1), 37-50

  • Haladyna, T. M., & Downing, S. M. (1989). Validity of a taxonomy of multiple-choice item-writing rules. Applied measurement in education, 2(1), 51-78.

  • Haladyna, T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied measurement in education, 15(3), 309-333.

]

Fisking the Haladyna Rules #25: Use carefully None of the above

[Each day in October, I analyze one of the 31 item writing rules from Haladyna, Downing and Rodriquez (2002), the super-dominant list of item authoring guidelines.]

Writing the choices: None-of-the-above should be used carefully.

First, I cannot tell you how much I hate this rule, nor how much it betrays the deep disrespect and distain that so many have for the work of content development in large scale assessment. No one would suggest that point-biserial or IRT should be used carefully, because everyone assumes that psychometrics is always done carefully. What does this rule or guidelines mea? Don’t be sloppy? Item developers should never be sloppy, whether they are using “None of the above” or not. They are professionals, and no professionals should be sloppy in their work.

Second, their 2002 sources and the empirical research is just split on this. There is no consensus.

Third, the 2002 article has an explanation that might be the most nuanced portion of the whole piece.

Given recent results and these arguments, NOTA [none of the above] should remain an option in the item-writer’s toolbox, as long as its use is appropriately considered. However, given the complexity of its effects, NOTA should generally be avoided by novice item writers.

Frankly, this kind of analysis should be applied to virtually their entire list, but it is nice to see it as least once. Of course “generally be avoided” is not actually actionable advice. It means that they can use it, but…I guess they should be careful, just like everyone else. Yeah, item development is hard.

Their none of the above analysis cites it making items more difficult, but their analysis of Rule 26 (all of the above) expresses concern that it makes items less difficult. Were they simply reporting on the literature, these differing results would be just different results for different phrases. But as they are offering their views in their actual recommendations, guidelines or rules, it is not even clear why a phrases impact on item difficulty automatically makes it objectionable. In fact, a plurality (48%) of their 2002 sources are against use of none of the above and only and slightly fewer (44%) are fine with it. There is no consensus.

Last, their 2004 book says, “When none of the above is used, it should be the right answer an appropriate number of times.” No, I do not have any idea what that is supposed to mean. My frequent co-author suggest that they mean something like “should only be the key approximately 25% of the time (for 4-option items) or approximately 33% of the time (for 3-option items)." But they’ve never shown that kind of thinking about how to read, understand or analyze items, so I don’t think she’s right. Of course, it her explanation has the benefit of giving some meaning to his rule—which otherwise lacks any

[Haladyna et al.’s exercise started with a pair of 1989 articles, and continued in a 2004 book and a 2013 book. But the 2002 list is the easiest and cheapest to read (see the linked article, which is freely downloadable) and it is the only version that includes a well formatted one-page version of the rules. Therefore, it is the central version that I am taking apart, rule by rule, pointing out how horrendously bad this list is and how little it helps actual item development. If we are going to have good standardized tests, the items need to be better, and this list’s place as the dominant item writing advice only makes that far less likely to happen.

Haladyna Lists and Explanations

  • Haladyna, T. M. (2004). Developing and validating multiple-choice test items. Routledge.

  • Haladyna, T. M., & Rodriguez, M. C. (2013). Developing and validating test items. Routledge.

  • Haladyna, T., Downing, S. and Rodriguez, M. (2002). A Review of Multiple-Choice Item-Writing Guidelines for Classroom Assessment. Applied Measurement in Education. 15(3), 309-334

  • Haladyna, T.M. and Downing, S.M. (1989). Taxonomy of Multiple Choice Item-Writing Rules. Applied Measurement in Education, 2 (1), 37-50

  • Haladyna, T. M., & Downing, S. M. (1989). Validity of a taxonomy of multiple-choice item-writing rules. Applied measurement in education, 2(1), 51-78.

  • Haladyna, T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied measurement in education, 15(3), 309-333.

]