Manipulating Item Difficulty

There are these psychometric guidelines on desired item difficulty levels. Items should not be so hard that too few test takers respond successfully and not so easy that too few fail to respond successfully. (Yes, this view is based in empirical observations about item difficulty—that is, the share of test takers who respond successfully.)

That’s just garbage. That is, if a test is supposed to be criterion-based, that is total garbage.

I guess it’s a question of what tests are for and what they are supposed to measure. If the goal is to sort or rank test takers, there simply is not a lot of information in such items. Sure, there is some, but not as much.

But certification/licensure exams and most educational testing is criterion based, not norm based. If the goal is to measure what test takers know (and don’t know) and can do (and can’t do), then it should not matter how empirically difficult an item is. If it is a criterion referenced test, then those items offer us just as much information as more moderate difficulty items.

Of course, this comes down to unidimensionality. If the goal of the test is some unidimensional results, highly difficulty or highly easy items might provide less information. But if the goal is to report on the many standards (or other elements in the domain model) in the construct, those easy and difficult items might provide large amounts of information about their targeted cognition.

Yeah, the idea that items should not be too easy or too difficult comes from, again, assumptions of unidimensionality.

Furthermore, there are other huge problems with this sort of thinking about item difficulty. It makes some sense to define item difficulty to empirically, but that actually shifts attention from the deeper meaning. Some ideas and lessons are more difficulty to teach and learn, and therefore get more attention from teachers and more practice from students. Other ideas and lessons are easier to teach and learn and therefore get less time and attention. That difference in attention to teaching and learning moderates the difference in difficulty when empirical definitions are used. So, what does difficulty actually mean, in terms of cognition?

Of course, it doesn’t stop there. The quality of instruction (as opposed to the quantity discussed above) also impacts learning and subsequent performance. The best teachers might even combine higher quality instruction with higher quantity instruction for some lessons—perhaps because of how difficult the content is and perhaps because of how important they think the content.

I got just one question wrong on my Chemistry SAT II subject test back in the dark ages. It was about what substance is purple when in a solution in water. I asked my chemistry teacher about this the next Monday and she knew the answer immediately, she just did not think this as important enough to spend time on in class. KMnO4? And 35 years later, I can’t really disagree with her. But it is not like this is a difficult idea or lesson. It’s trivial. It’s low cognitive complexity. It’s just memorized knowledge, and a small amount of memorization at that.

So, if this question never appeared on an important test before, it would likely not be taught in many chemistry classes. But if it appeared on every important chemistry test, it would be taught everywhere. Would that change the difficulty of the concept? As it was taught more, should the item be manipulated to maintain its difficulty. For example, the word purple might be replaced with the word aubergine. Would it be a higher quality item when altered to increase empirical difficulty?

Items should elicit evidence of the targeted cognition for the range of typical test takers—ideally without Type I or Type II errors (i.e., false positive and false negatives). Their difficulty should be determined by the difficulty of the targeted cognition and the quality/quantity of the teaching and learning about the targeted cognition. If the goal is report on learning, knowledge, abilities and/or skills, there well could be some items that everyone responds to successfully and some that no one responds to successfully. That should not be a problem.

And if the psychometric scoring models say otherwise, they should be replaced.