Fisking the Haladyna Rules #5: Avoid over specific/general

[Each day in October, I analyze one of the 31 item writing rules from Haladyna, Downing and Rodriquez (2002), the super-dominant list of item authoring guidelines.]

Content: Avoid over specific and over general content when writing MC items.

Their fifth rule has two fundamental problems. First, it is supported by just 15% of their 2002 sources and not even that many of their 1989 sources. Furthermore, fully 25% of their 1989 sources argue against this rule. But that general lack of support is not the biggest problem.

The biggest problem is that this rule does not mean anything. Do not do something too much is all they are saying. Sure, some people say you can never be too rich or too thin, but other than that sort of formulation, saying too (i.e., as in too much) or over (i.e., as in too much) makes the rule a tautology. That is, it is a logical circle. Yes, it would be bad for the item to be too hard. Don’t do that. It would be bad for the item to be too specific. Don’t do that. It would be bad for the font to be too small, too big, too baroque. Again, that’s just what “too” or “over” mean!

So, the real question is what it means to be over specific or over general. They seem to be saying that such things exist, but provide no guidance whatsoever for what they mean. They are providing an objection, but no basis for when it applies—of even what it really means. Nothing. Just nothing.

I do not think that I hate any of Haladyna et al.’s rules more than this one. It epitomizes the problem with their whole approach. Perhaps there is value in their original articles as literature review. Perhaps. But they have brought these lists forwards into handbooks and their own books. Others cite them and quote them all the time. They called their list, “a complete and authoritative set of guidelines for writing multiple-choice items.” Is it really? Does this example from their “Guidelines/Rules/Suggestions/Advice” (as they called them in 1989) actually help anyone to write or evaluate items?

Obviously not. I do not think it could be any more obvious.

[Haladyna et al.’s exercise started with a pair of 1989 articles, and continued in a 2004 book and a 2013 book. But the 2002 list is the easiest and cheapest to read (see the linked article, which is freely downloadable) and it is the only version that includes a well formatted one-page version of the rules. Therefore, it is the central version that I am taking apart, rule by rule, pointing out how horrendously bad this list is and how little it helps actual item development. If we are going to have good standardized tests, the items need to be better, and this list’s place as the dominant item writing advice only makes that far less likely to happen.

Haladyna Lists and Explanations

  • Haladyna, T. M. (2004). Developing and validating multiple-choice test items. Routledge.

  • Haladyna, T. M., & Rodriguez, M. C. (2013). Developing and validating test items. Routledge.

  • Haladyna, T., Downing, S. and Rodriguez, M. (2002). A Review of Multiple-Choice Item-Writing Guidelines for Classroom Assessment. Applied Measurement in Education. 15(3), 309-334

  • Haladyna, T.M. and Downing, S.M. (1989). Taxonomy of Multiple Choice Item-Writing Rules. Applied Measurement in Education, 2 (1), 37-50

  • Haladyna, T. M., & Downing, S. M. (1989). Validity of a taxonomy of multiple-choice item-writing rules. Applied measurement in education, 2(1), 51-78.

  • Haladyna, T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied measurement in education, 15(3), 309-333.

]

FISKING THE HALADYNA RULES #4: Keep items independent

[Each day in October, I analyze one of the 31 item writing rules from Haladyna, Downing and Rodriquez (2002), the super-dominant list of item authoring guidelines.]

Content: Keep the content of each item independent from content of other items on the test.

There is no explanation of this rule anywhere in either Haladyna and his colleagues’ 1989 article or 2002 article. Moreover, less than half of their sources for either article mention this rule. Their companion 1989 article does not list a single empirical study on this rule. And yet, they still call this a consensus.

One must turn to their 2004 or 2013 books to find an explanation. No, the explanation is not convincing. They offer a pair of items as a counter-example to demonstrate what is wrong. Note, this is THEIR example. They change the names, but I will use the more recent ones (2013).

Who was Kay’s best friend?

a. Wendy

b. Betty

c. *Tilda

Who was quarreling with Tilda?

a. Kay

b. Wendy

c. Betty

They claim this shows what is wrong because if the test taker knows that Tilda is the correct response to the first item then they will know that Kay cannot be the correct answer to the second item. Because…stories are never about good friends quarrelling? Not at all. Of course not. Where would the drama or story arc be in that?

Frankly, this rule makes it more difficult to ask anything but the most trivial questions about literary passages because the themes and characters run through them. No, those items are not independent in topic; after all, they are taken from the same story. This is almost as great a problem when using informational passages.

Now, there is an issue with the independence of items, one that I and one of my closest colleagues disagree on. As a science educator, she wants items that scaffold up to larger or deeper understanding. She thinks of item sets as a single larger items with various components, even when they technically are not. She wants later items to build upon the answers of earlier items—and even wants the structure of the item set to help test takers to do that. I really do appreciate what she is trying to do, and as a classroom teacher I might do the same thing. But we are both in large scale standardized assessment now. We are trying to find that optimal (or least bad) balance that allows test takers an opportunity at every point on an assessment, assesses test takers across the range of an NGSS standard (i.e., performance expectation), and yet does not provide so much scaffolding that we cannot be sure whether the test taker actually has the inferred proficiency.

How independent should items be of each other? Not nearly so much as Haladyna et al. claim. And their example is laughable.

[Haladyna et al.’s exercise started with a pair of 1989 articles, and continued in a 2004 book and a 2013 book. But the 2002 list is the easiest and cheapest to read (see the linked article, which is freely downloadable) and it is the only version that includes a well formatted one-page version of the rules. Therefore, it is the central version that I am taking apart, rule by rule, pointing out how horrendously bad this list is and how little it helps actual item development. If we are going to have good standardized tests, the items need to be better, and this list’s place as the dominant item writing advice only makes that far less likely to happen.

Haladyna Lists and Explanations

  • Haladyna, T. M. (2004). Developing and validating multiple-choice test items. Routledge.

  • Haladyna, T. M., & Rodriguez, M. C. (2013). Developing and validating test items. Routledge.

  • Haladyna, T., Downing, S. and Rodriguez, M. (2002). A Review of Multiple-Choice Item-Writing Guidelines for Classroom Assessment. Applied Measurement in Education. 15(3), 309-334

  • Haladyna, T.M. and Downing, S.M. (1989). Taxonomy of Multiple Choice Item-Writing Rules. Applied Measurement in Education, 2 (1), 37-50

  • Haladyna, T. M., & Downing, S. M. (1989). Validity of a taxonomy of multiple-choice item-writing rules. Applied measurement in education, 2(1), 51-78.

  • Haladyna, T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied measurement in education, 15(3), 309-333.

]

FISKING THE HALADYNA RULES #3: Use novel material

[Each day in October, I analyze one of the 31 item writing rules from Haladyna, Downing and Rodriquez (2002), the super-dominant list of item authoring guidelines.]

Content: Use novel material to test higher level learning. Paraphrase textbook language or language used during instruction when used in a test item to avoid testing for simply recall.

There is something in this rule that I like, but it really falls off. If there is a need for a second sentence, it certainly should not be that one.

Novelty in items is incredibly important, but it should not be limited to testing “higher level learning.” Honestly, what is higher level learning? What are they talking about? Later in the article they say problem solving and critical thinking, but that leaves us wondering why they didn’t just say that in the rule.

Also in the text, they write of this particular rule as an example of a rule without an empirical basis, but that that they advocate anyway. That might be fine, but check out their reasoning.

We might argue that many of these unresearched guidelines were common sense that did not justify research. For example, consider the guideline: “Use novel material to test higher level learning.” Most educators place a great value on teaching and testing higher level thinking. Thus, this item-writing guideline may be a consensus value among testing specialists without the need for research.

They offer that because educators agree on the goal, this is the right method. They do not even try to explain why or how novelty is so important for “higher level learning.” As an English teacher, I would read that lack of explanation as a lack of investment in thinking, which is so often the case with this list.

In fact, novelty is important for a wide range of cognition, and the issue is not solved simply by paraphrasing. If the example in an item is the same as one used in a teacher’s lecture or a some class activity, the test taker might simply recall the answer given to them by their teacher—rather than generate it themself as the test assumes they would. If it a reading passage is taken from a work read for class, they might recall their teacher’s (or fellow students’) explanation or analysis, rather than generate their own. For example, if you want to test whether they know about the dynamics of the Romeo & Juliet balcony scene, by all means use that excerpt. But if you want to know whether they can read Shakespeare’s language and understand it, you need to present something that test takers have not already had explained to them. Note that this is not simply about critical thinking, as it could simply be about understanding the plain language.

In fact, items need to be sufficiently novel such that test takers cannot simply rely on their memory of how that example was already explained to them, but not so novel that it requires significant new learning in order to make sense of. That can be a careful balance, and it is made all the more difficult because our formal curriculum can vary so from district to district, and even where the same formal curriculum exists, the enacted curriculum and lesson plans can vary enormously. It take real knowledge of how content is generally taught to find the appropriate level of novelty.

Taking derivatives (i.e., differential calculus) likely counts as problem-solving in Haladyna et al’s view. But the simple items should not ask about x-squared (2^2). That simply is not novel enough. Asking about x8 or x197 is no more difficulty for someone who understands, and yet is not going to simply recalled. However, I think that such a task does not rise to the level of critical thinking or problem solving. It is clearly what Webb’s Depth of Knowledge (DOK) and our own revised Depth of Knowledge (rDOK) would classify at the lowest level of cognitive complexity. And yet, that same need for novelty is just as necessary.

Yes, novelty is a very important idea in item development. But no, this rule does not get at it accurately. It is affirmatively damaging because it suggests that novelty is not important outside of “higher level learning” and that paraphrasing is a sufficient mitigation.

[Haladyna et al.’s exercise started with a pair of 1989 articles, and continued in a 2004 book and a 2013 book. But the 2002 list is the easiest and cheapest to read (see the linked article, which is freely downloadable) and it is the only version that includes a well formatted one-page version of the rules. Therefore, it is the central version that I am taking apart, rule by rule, pointing out how horrendously bad this list is and how little it helps actual item development. If we are going to have good standardized tests, the items need to be better, and this list’s place as the dominant item writing advice only makes that far less likely to happen.

Haladyna Lists and Explanations

  • Haladyna, T. M. (2004). Developing and validating multiple-choice test items. Routledge.

  • Haladyna, T. M., & Rodriguez, M. C. (2013). Developing and validating test items. Routledge.

  • Haladyna, T., Downing, S. and Rodriguez, M. (2002). A Review of Multiple-Choice Item-Writing Guidelines for Classroom Assessment. Applied Measurement in Education. 15(3), 309-334

  • Haladyna, T.M. and Downing, S.M. (1989). Taxonomy of Multiple Choice Item-Writing Rules. Applied Measurement in Education, 2 (1), 37-50

  • Haladyna, T. M., & Downing, S. M. (1989). Validity of a taxonomy of multiple-choice item-writing rules. Applied measurement in education, 2(1), 51-78.

  • Haladyna, T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied measurement in education, 15(3), 309-333.

Fisking the Haladyna Rules #2: Important, not trivial content

[Each day in October, I analyze one of the 31 item writing rules from Haladyna, Downing and Rodriquez (2002), the super-dominant list of item authoring guidelines.]

Base each item on important content to learn; avoid trivial content.

Oh, we hate this one. We think that this rule undermines any concept of item or test alignment, even though it 42 of their 54 sources apparently support it.

It simply is not for item developers to decide what to test. Standards or other forms of domain modelling lay out what should be tested. It is the job of item developers to figure out how to assessment that content, not whether to assess that target.

Furthermore, this rule rather falls towards a tautology. Obviously, given limited testing time—as is invariably the case—the time should be spent well. Yeah, only test things worth testing. Perhaps this rule does not quite reach the level of tautology, but it does not get beyond too obvious to need to be said. As is, either it is too obvious, or begs the question. That is, what counts as trivial? What counts as important. Do they give any guidance on that?

We would prefer it to be phrased as, Don’t waste test takers’ time. But should that really need to be said?

Now, there is a related point that they are not making here, but is very very important. That is, when creating an item aligned to some assessment target, aim for the core of that target. Aim for the most important part of the standard, the part this most useful or most likely to be built upon later. Do not simply aim for the easiest part of the target. Yes, that would easier. But it would not help test validity in any way, would not help test takes or other users of tests.

But that is not what this rule is about.

[Haladyna et al.’s exercise started with a pair of 1989 articles, and continued in a 2004 book and a 2013 book. But the 2002 list is the easiest and cheapest to read (see the linked article, which is freely downloadable) and it is the only version that includes a well formatted one-page version of the rules. Therefore, it is the central version that I am taking apart, rule by rule, pointing out how horrendously bad this list is and how little it helps actual item development. If we are going to have good standardized tests, the items need to be better, and this list’s place as the dominant item writing advice only makes that far less likely to happen.

Haladyna Lists and Explanations

  • Haladyna, T. M. (2004). Developing and validating multiple-choice test items. Routledge.

  • Haladyna, T. M., & Rodriguez, M. C. (2013). Developing and validating test items. Routledge.

  • Haladyna, T., Downing, S. and Rodriguez, M. (2002). A Review of Multiple-Choice Item-Writing Guidelines for Classroom Assessment. Applied Measurement in Education. 15(3), 309-334

  • Haladyna, T.M. and Downing, S.M. (1989). Taxonomy of Multiple Choice Item-Writing Rules. Applied Measurement in Education, 2 (1), 37-50

  • Haladyna, T. M., & Downing, S. M. (1989). Validity of a taxonomy of multiple-choice item-writing rules. Applied measurement in education, 2(1), 51-78.

  • Haladyna, T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied measurement in education, 15(3), 309-333.

]

Fisking the Haladyna Rules #1: Single content and behavior

[Each day in October, I analyze one of the 31 item writing rules from Haladyna, Downing and Rodriquez (2002), the super-dominant list of item authoring guidelines.]

Content: Every item should reflect specific content and a single specific mental behavior, as called for in test specifications (two-way grid, test blueprint).

We actually like this rule. This is a good start for the Haladyna list. No, this 2002 rule has no real antecedent in the 1989 list, but that is to the 2002 list’s credit. And this rule is only supported by the sources who mention it, though that is not quite three quarters of them.

The problem we have with this rule is that Haladyna and his colleagues never explain it anywhere. We have found that, in practice, the meaning of and reasoning behind this rule is often unknown. Frankly, we cannot be sure that that Haladyna et al. even mean what we would mean by this rule, and that’s a real problem.

We believe that it is important that each item be aligned to one specific assessment target. That targeted cognition should come from a domain model. Quite often, this is a standard from a set of state learning standards, or it could be some element from a job or role analysis. We believe in domain modelling and domain analysis; we love ECD (i.e., evidence centered design). (We recognize that the good work done by ECD to highlight the importance of domain models came after 2002, so we forgive Haladyna et al. for thinking that assessment targets just come out of test specifications.) 

We know that it is important that items each target just one thing because otherwise there would be no way to determine why a test taker responded to the item unsuccessfully. They could just be making the same mistake over and over again, each time one standard is part of an item, even though they have mastery over all the rest. We should not be basing inferences of the successful learning, or teaching or coverage of a standards (i.e., when evaluating students, teachers or curriculum, respectively) with such ambiguous sorts of evidence.

Just as importantly…well, actually more importantly, each item should actually depend appropriately on that targeted cognition. There should not be alternative paths to a successful response available to test takers. They should have to make use of the target cognition to get to a successful response and that targeted cognition should be the key step (i.e., the thing they mess up, if they are going to mess anything up). Otherwise, items can yield false-positive or false negative evidence. 

Is all of that clear in how Haladyna et al. phrased their rule? Is it made clear elsewhere in their article? Is it made more clear elsewhere in all their writings? Not really.

[Haladyna et al.’s exercise started with a pair of 1989 articles, and continued in a 2004 book and a 2013 book. But the 2002 list is the easiest and cheapest to read (see the linked article, which is freely downloadable) and it is the only version that includes a well formatted one-page version of the rules. Therefore, it is the central version that I am taking apart, rule by rule, pointing out how horrendously bad this list is and how little it helps actual item development. If we are going to have good standardized tests, the items need to be better, and this list’s place as the dominant item writing advice only makes that far less likely to happen.

Haladyna Lists and Explanations

  • Haladyna, T. M. (2004). Developing and validating multiple-choice test items. Routledge.

  • Haladyna, T. M., & Rodriguez, M. C. (2013). Developing and validating test items. Routledge.

  • Haladyna, T., Downing, S. and Rodriguez, M. (2002). A Review of Multiple-Choice Item-Writing Guidelines for Classroom Assessment. Applied Measurement in Education. 15(3), 309-334

  • Haladyna, T.M. and Downing, S.M. (1989). Taxonomy of Multiple Choice Item-Writing Rules. Applied Measurement in Education, 2 (1), 37-50

  • Haladyna, T. M., & Downing, S. M. (1989). Validity of a taxonomy of multiple-choice item-writing rules. Applied measurement in education, 2(1), 51-78.

  • Haladyna, T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied measurement in education, 15(3), 309-333.

]

Sometimes, You Lose

This week, my local school board voted on a final appeal by a very small group of people in my district to ban five books. The school board used this special working group meetings to act more like a book group than a policy and oversight board, which troubled me. But at the the end of the discussion of each book, they voted not to ban it from the classroom or the libraries of the town.

(Toni Morrison will not be banned, here. Books about coming out as gay will not be banned here. Sex education reference books will not be banned, here. Book that display honest and truthful depictions of the teen experience will not be banner, here.)

Yes, that is a victory. But it is a battle that should not have to be fought.

You see, we had a school board election a couple of years ago in which the candidates newly nominated by the recently radicalized/trumpified local Republican committee—which included none of the Republican incumbents—all lost, and lost badly. They lost even though by local regulation, no single party can hold more than a bare majority of seats on the school board.

Which means that the Trump-Republican candidates got the GOP line on the ballot and the GOP incumbents had to run as independents. And the crazies still lost 2-to1. I do not mean that they only got 1/3 of the the school board; I mean that their candidates only got 1/3 of the vote. And I’d bet serious money that most of that was simply habitual supporters of GOP candidates, as opposed to voters who understood what they were running on.

These losers simply do not accept that they have lost. The community heard what they have to say about these values that they wish to rule by, and the community turned them down decisively. The community does not believe that educators are groomers or that our public schools are at threat to children or families.

I wish that these kind of refusal to accept that they have not convinced anyone was limited to this district. Unfortunately, we see the losers on bigger stage than this. Today, the Supreme Court agreed to hear a case in which these same sorts of people want private businesses who do not want to be associated with their garbage to be forced to publish it, bearing whatever it costs them. Freedom of association and freedom of expression be damned!

As Twitter (now X) makes more room for this garbage on its own platform, advertisers are running away. The market speaks.

But these losers—in elections and in the market—simply refuse to listen. They want to force their ideology on others, abandoning any respect for democracy or markets.

It is bad enough that their ideas are odious, grounded in fantasies and other untruths, based on a need to hate and demean others, theocracy and quite often naked racism. No, that’s not enough. They also refuse to accept when they have lost or adjust their goals—or even their tactics—for that reality.

I do not know how to live in a society with people like that. I just don’t.

Fisking the Haladyna Rules:Introduction

In 1989 and in 2002, Haldadyna and his colleagues published compilations of item writing rules from the literature. These two lists, which I call The Haladyna Rules have been incredibly dominant ever since. Virtually every serious source of test development refers to them as the item writing guidance. In 1989, Haladyna & Downing called their list, “a complete and authoritative set of guidelines for writing multiple-choice items” (p. 37) and then they updated and refined it further with Rodriguez in 2002.

The problem is that the list is just not very good. It focuses on presentation and perhaps test takers’ confusion, but really gives rather little guidance about how to make sure that the item actually tests what it supposed to test. Now, we call this idea that a test (or item) is supposed to test what it is supposed to test, “validity.” There are actually much more complicated ideas and principles for validity, but at the fundamental level, the question is whether the test is doing what it is supposed to do. And the Haladyna rules that so dominate the “Item-Writing Guidelines/Rules/Suggestions/Advice” (1989, p. 40) simply steer clear of this idea, for the most part. (Among those more technical concerns and views about validity is that it is actually how the test is used that matter. But whether a test even can be used appropriately depends deeply on whether it is actually testing what it is supposed to test.)

But that is not the worst thing about them. I will save the worst thing about for the end of this series.

You see, there are 31 rules in the 2002 version. And there are 31 days in October. (Hello, lady!) So, each day of the month of October, I will analyze one of Haladyna, Downing and Rodriquez’s 31 rules. In order.

Yes, I am saying that rule 31 is the worst thing about the list.

Buckle up.

Giving Writing Advice

Both as a high school English teacher and as a dissertation coach, I have lot of experience giving people advice on their writing. As student, a collaborator and researcher (and occasional blogger) I have a decent amount of experience receiving advice on my writing. I have observed that there are two basic approaches. Both aim to help the writer produce a better piece, but they come from very directions and have quite different goals.

(Side note: Unless someone specifically asks for it, proofreading is really not nearly as valuable as so many people seem to think. Proofreading a piece that is not yet otherwise in its final form can be a little bit useful, but the fact is that many words and sentences and even paragraphs are going to change, creating numerous opportunities for little mistakes to be (re-)introduced later, anyway. This kind of feedback is not advice, as it doesn’t actually help the writer to know whether the piece works and/or give them any ideas for how to improve it.)

The approach that I aim for tries to build on what is there in the piece, to help the author do a better job of meeting their goals, of communicating what they want to say. This approach requires reading the piece for the author’s perspective and asking questions about it to give advice that is centered on them and their hopes for the piece. Sure, this sort of advice might result in wholesale reworking on the piece, but only to make it more clear what they are trying to get it, and to improve the piece’s ability to convince others or to convey their idea to others. This approach is centered on them and their thinking, helping them to write the best version of their piece that they can be most proud of. This approach is all about supporting them. I learned to do this from my own best teachers—because they were focused on helping my grow as thinker, rather than just tearing apart where I was at the time.

The alternative approach is centered on the concerns and interests of the advisor, instead of the writer. The advisor thinks about what they would most like to read, or other ideas that the draft remind them of. They offer suggestions for making the piece more like something that they would write, perhaps stylistically or even in the ideas and/or perspective in the piece. This kind of advice is actually less about improving the piece than is about making it feel more intuitive to the advisor.

Of course, there are a couple of circumstances in which the second approach is actually quite appropriate. If the advisor is actually a gatekeeper who gets to decide what is published on their platform, they may be concerned about the voice or content that their platform consciously tries to cultivate. They may make exceptions sometimes, but they know that they will not do anything like that in his case. Such advisors are in a position to put their own preferences ahead of the writer’s own concerns, as the writer likely seeks the piece to be included there. In such cases, the writer can decide whether it is worth it to them to give in on what they wanted to say or on their voice in order to be included on that platform. Though that advisor is trying to exercise a lot of power, it remains up to the writer to accept that deal, or not.

The other circumstance is much much more rare. There are occasional advisers who are really good at having a feel for the target audience of readers—as opposed to just a sense of their own personal preferences. Such advisors would not actually be pushing their own voice and/or ideas, but rather channeling what the that desired audience might respond to best. That might call for some serious shifting of the piece, even if it shift somewhat from the author’s original intentions.

This dynamic is not limited to advise on a piece of writing. I first noticed this dynamic when I was a teacher and paid close attention to the advice that principals and assistant principals give teachers. I cannot tell you how many times I had directed at me or heard directed at others advise along the lines of, “Well, when I was teaching, I would…” What I never heard from them was, “This never worked for me, but I think something that would work for you is…” On the other hand, I did hear senior teachers say, “You should watch how [other teacher] handles that. I think that they do something that would work for you.”

As a teacher, I want to help my students to better express what they want to say. If it is not working or I do not think that it can work, I might engage in a conversation to find something related from their own perspective and values that they might want to say instead. But I always always try to find the kernel of what is really important to them to help them express that in a way that is true to themselves. Obviously, as a dissertation coach, the goal is a form of writing that does not feel natural or authentic to anyone, but that only makes it more important that the ideas and perspective are deeply grounded in their own experiences and views.

There is something deeply frustrating about advice to writers (or others) that marginalizes their views and goals. Sure, it might yield a better piece, though that is questionable. It is simply more likely to yield a piece that is more agreeable to the advisor, but that is not the same thing.

My Most Conservative Policy Opinion

I believe in democracy.

It is not that I believe that democracy always comes to the right answer. It is not that when democracy comes to an answer different than my own that I am convinced that I am wrong. But I was taught that politics is how groups of people make decisions about values, and in this country our political system is a democratic one. And you know Winston' Churchill’s famous description, “Democracy is the worst form of Government except for all the other[s].”

What is the alternative? Divine right of kings? Oligopoly? I believe that the more people that vote, the more legitimate the decisions are. Surely, high quality education matters, because an electorate of well informed voters will yield better decisions than one fill with ignorant and thoughtless ones. This is what so many state constitutions say; this is why we need strong universal public schools.

Of course I would rather everything be done according to my values and my wishes and my judgment. But if we are instilling all the power in a single person, it’s not often going to be me. So, I prefer the wisdom of the crowd to any other mechanism. (With some sorts of constitutional limits, of course.)

And yet, for my whole career, I have seen educators and others deny the validity of democracy when they don’t agree with The People. They want to override the will of The People and the will of their communities. They think they know better and should be able to apply their own moral judgment in place of our democratic system of government and laws.

Now, I understand the challenge of a situation in which one’s own values are quite at odds with the values embedded in the law. I understand the outrage of seeing marginalized and disempowered people (and groups of people) further harmed by the actions of governments—especially when those people are children. Especially when those people are children. I feel that deeply.

But what is the alternative? I do not want an impassioned minority to impose its mistaken will upon me. How can I tell the difference between those situations and my own desire to impose my own minority will on others? How do we determine a system of government on any level if every impassioned minority thinks that they have the right to impose their will upon others?

We have seen better paths. We have seen the civil rights movement engage in direct action and civil disobedience. We have seen a Ghandi-led colony throw off its colonial oppressor by exposing the moral excrescence at the center of that enterprise. We have seen attitudes change—and the law change—as more people learned more about same-sex love and LGB people around them. We have seen paths.

I am deeply saddened—and at times infuriated—at what appear to me to be right wing minority views of what schools may do to support and educate children. I have seen irrational fear mongering and unrealistic views of what is actually developmentally appropriate for children at different ages, built upon willful ignorance born of arrogant self-centered intellectual laziness. I have been appalled and morally outraged. These asses exist in my community, as they exist around the country.

Luckily, I live in the part of the country where these asses have not gained control over the levers of government, where they so obviously in the minority and cannot hide behind other culture war issues. I know that, therefore, it is easier for me to preach democracy. I have no children being targeted by them. I know that, therefore, it is easier for me to preach democracy. It is easier for me to have patience for the democratic mechanisms of discussion and debate to work their way through elections to policies.

But what is the alternative?

Manipulating Item Difficulty

There are these psychometric guidelines on desired item difficulty levels. Items should not be so hard that too few test takers respond successfully and not so easy that too few fail to respond successfully. (Yes, this view is based in empirical observations about item difficulty—that is, the share of test takers who respond successfully.)

That’s just garbage. That is, if a test is supposed to be criterion-based, that is total garbage.

I guess it’s a question of what tests are for and what they are supposed to measure. If the goal is to sort or rank test takers, there simply is not a lot of information in such items. Sure, there is some, but not as much.

But certification/licensure exams and most educational testing is criterion based, not norm based. If the goal is to measure what test takers know (and don’t know) and can do (and can’t do), then it should not matter how empirically difficult an item is. If it is a criterion referenced test, then those items offer us just as much information as more moderate difficulty items.

Of course, this comes down to unidimensionality. If the goal of the test is some unidimensional results, highly difficulty or highly easy items might provide less information. But if the goal is to report on the many standards (or other elements in the domain model) in the construct, those easy and difficult items might provide large amounts of information about their targeted cognition.

Yeah, the idea that items should not be too easy or too difficult comes from, again, assumptions of unidimensionality.

Furthermore, there are other huge problems with this sort of thinking about item difficulty. It makes some sense to define item difficulty to empirically, but that actually shifts attention from the deeper meaning. Some ideas and lessons are more difficulty to teach and learn, and therefore get more attention from teachers and more practice from students. Other ideas and lessons are easier to teach and learn and therefore get less time and attention. That difference in attention to teaching and learning moderates the difference in difficulty when empirical definitions are used. So, what does difficulty actually mean, in terms of cognition?

Of course, it doesn’t stop there. The quality of instruction (as opposed to the quantity discussed above) also impacts learning and subsequent performance. The best teachers might even combine higher quality instruction with higher quantity instruction for some lessons—perhaps because of how difficult the content is and perhaps because of how important they think the content.

I got just one question wrong on my Chemistry SAT II subject test back in the dark ages. It was about what substance is purple when in a solution in water. I asked my chemistry teacher about this the next Monday and she knew the answer immediately, she just did not think this as important enough to spend time on in class. KMnO4? And 35 years later, I can’t really disagree with her. But it is not like this is a difficult idea or lesson. It’s trivial. It’s low cognitive complexity. It’s just memorized knowledge, and a small amount of memorization at that.

So, if this question never appeared on an important test before, it would likely not be taught in many chemistry classes. But if it appeared on every important chemistry test, it would be taught everywhere. Would that change the difficulty of the concept? As it was taught more, should the item be manipulated to maintain its difficulty. For example, the word purple might be replaced with the word aubergine. Would it be a higher quality item when altered to increase empirical difficulty?

Items should elicit evidence of the targeted cognition for the range of typical test takers—ideally without Type I or Type II errors (i.e., false positive and false negatives). Their difficulty should be determined by the difficulty of the targeted cognition and the quality/quantity of the teaching and learning about the targeted cognition. If the goal is report on learning, knowledge, abilities and/or skills, there well could be some items that everyone responds to successfully and some that no one responds to successfully. That should not be a problem.

And if the psychometric scoring models say otherwise, they should be replaced.

Barbie, Patriarchy and “Patriarchy”

Forgive me for posting some film commentary.

I really liked the Barbie movie. I think it is really smart, and somewhat subversive in how it makes its feminist arguments. (I will try to avoid spoilers.)

There are two main feminist arguments presented in Barbie. They are presented quite differently, and serve very different functions in the narrative.

The most important narrative driver of the Barbie movie is the idea that there is an essential emptiness in one gender living without power, without access to meaningful or purpose, other than to be seen and valued by the other gender. Ken voices that he is in this position very early in the film. This existential state — and its untenability — is there when the audience is first learning about Barbieland and addressed both by and in the film — to almost the very very end. This is part of Betty Friedan’s Feminine Mystique. The Barbie movie explicitly addresses the untenability of this sort of role, but inverts gender roles in order make the point. Nonetheless, this remains a powerful feminist thrust of the film, one of the most powerful feminist arguments of the 20th century, even though it is presented in the film as the state of men’s positions. It is not overly presented as a feminist argument or problem, just as Ruth Bader Ginsberg — before she was appointed to the courts — used male plaintiffs to illustrate how our society categorizes and determines sex role problematically. 

The second main feminist argument in the movie comes from the middle aged mother, when she talks about challenges of being a woman amidst contradictory expectations that leave little room — or no room — for actual ordinary variation in experience and behavior from day to day. 

In my view, the feminine mystique argument is about patriarchy, though the film does not make that claim. Of course, the film cannot make that claim, because the Kens do not live in a patriarchal world when they feel the effects of this dynamic. Yes, “patriarchy” is mentioned many times in the film, and something called “patriarchy” is a major element of the narrative. But this feminine mystique issue is not a part of any so-called “patriarchy” in the film, even though in the real world it is entirely a function of pervasive patriarchal expectations for social structures and relationships. 

On the other hand, while I think that the mother’s explanations of problematic dynamics around dichotomous expectations for women are both powerful and importantly feminist, I think that it is a stretch to say that they are just about patriarchy. To claim they they all are rooted in patriarchy is the claim that all (social and interior) problems suffered by women are due to patriarchy. I believe that society’s subtle and overt, rare and pervasive, internalized and social subversion and undermining of women has broader roots than simply patriarchy, and it is too easy — to the point of intellectual dishonesty — to put them all at the feet of patriarchy. And this means that this second major feminist argument is actually not about actual patriarchy — despite how it is positioned in the film. 

I walked out of the movie theater trying to figure out how anyone could seriously object to the its political contents and messaging, particularly object from the right. I suppose people who feel and say that patriarchy is good — and who use the word “patriarchy” to do so — should be offended by the film. After all, the film’s depiction of “patriarchy” does not address anything about patriarchy that they would defend. Certainly, the vapid Kens (spoiler alert!) do not come to any deep understanding or the nature or power of patriarchy. The only intelligent response from the right that I can imagine is, “If that’s what you mean by ‘patriarchy,’ I can see why you object to it. We object to that, too." As I have thought more (and hopefully deeper), I have realized that the film’s theme’s and idea’s relationship to patriarchy or even more complicated that that.

For example — to spoil a joke, though not anything about plot or character — the idea that love of horses is particularly a masculine or patriarchal thing is obviously false. Girls love horses, too. In our society, horses might even be stereotypically more a girl interest than a boy interest — and they certainly were growing up in my family.

Thus, while the Barbie movie — a fun film full of many funny moments — has a lot to say about feminism, certain ideas of masculinity, self-identity, Mattel, consumer markets, patriarchy, the challenges of being a women (or girl) and the nature (and significance) of Barbie, it quite often masks  — or even mislabels — what is saying. Sometimes, it is explicit and on the nose. And sometimes, it is more subversively subtle than that. It twists these two main feminist arguments into an interesting plot and a — for me — politically satisfying resolution.

Barbie covers a lot of ground, and it should not be taken at face value or trusted to be presenting everything as obviously as it presents some things.


Authenticity in Large Scale Standardized Assessment

Shocking though it may seem, content development professionals (CDPs) value authenticity in test items. Unfortunately, there are many constraints on test development that prevent CDPs and other test developers from prioritizing authenticity as they might otherwise wish. For example, the simple fact of limited “seat time” (i.e., how long a test can be) can be an obstacle to including many aspects of authentic work on such assessments.

But authenticity actually plays out quite differently in different content areas. In all cases, authenticity can increase test taker engagement. In content areas, authenticity can help tests to get at the actual standards that both instruction and assessment are supposed to target.

A major problem for ELA test development is the use of commissioned work for reading passages. This can solve many problems, but it often creates a real authenticity problem. Passages written specifically for assessment lack an authentic purpose. For example, letters to the editor are not to real editors and from real concerned readers. Little commissioned stories — or supposed excerpts from longer works — do not generally have the same quality as those that are published by commercial publishers.

Because so much of ELA instruction — and even more of ELA assessment — is aimed at students’ skills at reading and understanding the kinds of texts that they may encounter in their lives, authenticity in ELA assessment often comes down to text selection. Authentic texts are simply critical to authenticity with ELA. Items that ask questions that readers might naturally have about those texts also contribute to authenticity.

Authenticity in science assessment is quite different. The kinds of writing about science that naturally exist in the world are rarely an appropriate grade level for large scale assessment, and may include many issues and topics that extend beyond any one grade level’s standards. Authentic writing about science is rarely appropriate for inclusion in large-scale standardized assessment.

Nonetheless, authenticity is prized in science assessment. It just takes a different form than in ELA. In science, authenticity comes in the nature of the phenomenon at the focus of the item or scenario set. When it feels like it comes from test takers’ own world and lives, the phenomenon contributes authenticity. When items ask the kinds of questions that a real person — or perhaps even a scientifically minded person — might have about the phenomenon, they contribute authenticity.

This is not to say that relevant topics do not contribute to authenticity in ELA assessments. Certainly they do. But even less relevant topics can feel authentic when an author breathes the life into them that we see in great writing. But in science assessment, it is the topic — the phenomenon — that primarily drives authenticity, rather than its presentation.

Art, Science or Professional Practice?

We hate the formulation that some things are more art than science, and every one of its variants. We think that they miss the nature of art, of science and of the practice being commented upon.

Half of the idea of this formulation that is that there are some fields that are predetermined and objective. These are fields with rules — even laws — that render every decisions a technical decision with definitive answer. Those are said to be science.

The other half of the idea in this formulation is that there are some fields that are creative and free, without standards or rules. They are full of personal judgment and preference, instead. Those are said to be art.

We think that this is bullshit.

First, that is most definitely not what science is. Science is a process of inquiry, of trying to understand the natural world, of knowledge creation. This process is creative, involves jumps of intuition, doubt, verification, correction, and iteration. The things that this scientific process has taught us are uncertain and always subject to further refinement — and at times even just replacement. Science never gives certain answers, and the authentic scientific process is full of subjective decision-making.

Second, the effort to liken something to art is often a backhand compliment — at best. It denies their existence of expertise, the importance of learning and the very idea the product of the work is appropriate for examination for effectiveness. It denies the possibly of real advancement.

We far prefer recognizing when fields are full of professional judgment. In our view, this means employing the tools for a diverse toolbox towards a particular purpose. Yes, it involves weighing multiple factors without a single clear definitive formula. And yes, it allows for multiple paths to a high quality product. It is not as objective and predetermined as that fantasy of science supposes; instead, it involves the complexity and judgment of actual science.

Art does not have clear standards, or clear goals, and not even a definitive audience. Professional practice has clear goals, clear purpose, recognizable standards of quality — even if they are not definitive — that can be recognized by other professionals. Professional practice should become more effective at achieving its goals over time — both the practice of an individual professional and capabilities of the larger profession. This fundamentally unlike art, which does not have clear goals or purpose, it is more free to evolve in unpredictable directions, which do not necessarily constitute advancement as much as sampling shifting.

In fact, creativity exists in many professions, including science and engineering. Creativity is not at odds with the scientific process or scientific advancement. Creative problem-solving is a part of engineering and a part of mathematics.

We certainly deny that item development is more art than science or as much art as science. These ideas denys that content development professionals develop professional judgment, that the field has standards or that its products can be evaluated by CDPs for effectiveness. Instead, it insists that everything is really more of an opinion than anything else.

Content development work, like most every profession, relies on professional judgment. Professional practice is the application of that judgment in applying the knowledge and tools of the profession, towards the goals that that profession values.

Professional Credibility of Content Development Professionals

One of the earliest motivations of the Rigorous Test Development Project was the disparity of standing between psychometrician’s and content development professionals (CDPs). The people who edit and refine the items that appear on tests simply did not have the same seat at the table as the people who did the statistical analysis of test results.

Thus, an original goal of the RTD project has been to raise the standing of CDPs – as they are the ones who work on item validity, a necessary foundation for test validity. (note: CDPs are not just item writers — who write initial drafts of items. CDP work is much more complex and extended than mere item drafting.)

The most respected professions combine copious technical knowledge with well honed professional judgment. Aspirants spend years, studying their discipline before engaging in some form of apprenticeship, in which they gradually build their own professional judgment as they observe, and or supervised by more experienced practitioners.

Less respected professions are not recognized as having the same quantity and difficulty of technical knowledge, nor requiring the same degree of professional judgment, in order to be practiced at a high-level. Thus, respect is often a function of perception and disrespect a function of ignorance.

Professional certification exams exist in many many fields, and yet every one we have ever met who has taken such an exam complains that it fails to assess what it really takes to perform the work. We believe these exams tend to focus too much on the technical knowledge (often taken out of authentically complex and interconnected contexts). Thus, they fail to address the higher level skills and professional judgment that is the mark of true expertise – – or even successful practice.

There are are bad lawyers, bad doctors, bad architects, bad, beauticians, bad plumbers, and bad project managers. In many of these fields, one cannot even be a bad practitioner without being professionally certified. Licensed professional certification does not ensure quality, and it certainly does not ensure respect for a the profession.

In our view, there are two things that truly distinguish the work of psychometricians and of contact development professionals that contribute most strongly to this disparate standings.

First, psychometricians study their angle(s) on assessment in school and obtain advanced degrees in their field. They can earn masters degrees and PhD’s in measurement. Content development professionals may or may not have advanced degrees, but they are not degrees in assessment. (They may be in teaching or in their subject matter field.) Moreover, not every CDP holds an advanced degree.

Second, many people have more respect for STEM fields than other fields. They perhaps perceive those fields as harder than other fields – confusing the fact that many STEM problems have obviously and definitively correct and/or incorrect answers with the ideas of rigor and difficulty.  They seem not to understand the difficulty of coming to a truly high-quality result when it is not always easily obvious whether the result is correct or incorrect. The need to balance multiple criteria to find a truly good – – or even great – – answer is not always as respected as the kind of technical work that is built on predetermined routines and literal formulas, if those routines and formulas involves lots of numbers.

So, what can we do about the lack of standing of CDPs?

Professional licensure is no guarantee of respect from those with PhD‘s or degrees from the most respected professional schools.  Certification exams do little to ensure respect or high-quality work. (In fact, regulation is usually about providing a floor for quality, rather than raising the ceiling on performance.) Of course, that presumes that there is someone who could establish and perhaps require such licensure, which is questionable in the first place.

Our efforts in the RTD project are focused on codifying the kinds of knowledge, skills and decision making that CDPs engage with every day, both the assessment-specific examples and the ways that subject matter expertise and knowledge of the perspectives of test takers inform the assessment work. We can envision this body of knowledge becoming a formal course of study as an option in educational measurement option, someday. But that is form of credentialing is not the goal of our Project. We are still trying establish the body know knowledge and skills in a form that CDPs and non-CDPs both can see.

When if comes to the standing of CDPs in test development circles, we ask that non-CDP test developers, take note of the variety of technical knowledge that CDP tap into, and their use of professional judgment to balance issues that psychometrics often barely has language for.

Perhaps this way that can engage the kind of professional humility that is so important for learning and for collaborative work, and will better understand what their CDP colleagues can bring to the table.

Undermining Reading Standards


As a former high school English teacher, teaching reading skills is very important to me. This goes beyond mere phonics and decoding. At the high school level, we focus more on the content of the text. This can include implications and layers of meaning. It can include quite dense information. Highly proficient readers take many different issues into account.

The reading standards differentiate reading informational texts from reading literary texts. The basic idea is that reading a novel or short story is quite different than reading a newspaper article or a text book. I am not sure that the reading standards quite get at the distinctions as I might wish, but there truly are important differences.

We we have been working on understanding cognitive complexity of reading items on tests, we have faced a truth that makes each of us a bit upset: test prep encourages reading literary texts as though there are informational texts.

Literary texts should generally be read and enjoyed linearly, from beginning to end. They are usually narratives, and while their structure might push a reader to refer back to earlier pages from time to time, they generally move forward. On the other hand, reading informational texts is often much more purposeful. The reader of a literary text should trust the author to take them where the author wishes, the reader of informational texts might bring much more purpose to the task. They might be looking for particular information.

Therefore, authentic reading of informational texts can be much more strategic and intentional than reading of literary texts. But so many test questions about literary passages turn those passages into informational texts. The reader is looking for the answer to a particular question — about some detail or the meaning of a particular excerpt – rather than taking in a making meaning from the text more organically. This means that students can be encouraged to apply strategies of informational reading to what should be literary reading.

I am not sure how to get around this problem. Certainly, I am no fan of such directed and often simplistic items — even the best versions of those items. Authentic reading and appreciation of literary texts — be they fiction or non-fiction — should not be picayune in its focus or goals.

Certainly, the emphasis on multiple choice items — generally for their speed and scoring economy — is a problem. In fact, they might be incompatible with this sort of reading (and therefore reading standards).