In the world of standardized assessment, item difficulty is empirically measured. That means that it is not a reference to conceptual difficulty of the KSAs (knowledge, skills and/or abilities) that the item draws upon. Nor is it a reference to how advanced those KSAs are thought to be.
Rather, item difficulty is measured. It is the percent of test takers (or field test takers) who answered the item successfully. The math is a little more complicated for polytomously scored items (i.e., items for which test takers can receive partial credit), but the same basic concept holds. The statistic, p, is simply the percent of available points that test takers earn across the population.
This makes the calculation item difficulty easy. However, it makes the meaning of item difficulty rather…ambiguous.
Imagine some chemistry or mathematics item with a low p. That is, a relatively more difficult item. What does that low p tell us?
Could it be that item is drawn from later in the course sequence, so test takers have had less time to practice its notable KSAs it and build upon them? Perhaps so late that some test takers’ classes had not covered it yet when they took the test?
Could it be that the item is from an early topic, but is a particulary nasty application of the KSAs? That is, an application requiring a greater level of skill or understanding with the KSA(s) to answer successfully, as opposed to a more routine application?
Could it be that the item combines a such a variety of KSAs that it provides an unusually large number of opportunities to reveal test takers’ potential misunderstandings? That is, different test takers might fall short for a range of different shortcomings in their KSAs?
Could it be that item has speededness issues. That is, the item takes longer to complete successfully than most items, leading many test takers to simply—and perhaps wisely—to bail on it in order to use their time more efficiently.
Could it be a multiple choice item with a very tempting alternative choice. That is, a distractor that a very common misunderstanding in the targeted KSAs perfectly captures?
Could it be a multiple choice item with a different sort of very tempting alternative choice? That is, a distractor that a very common mistake that is not tied to the targeted KSAs perfectly captures?
Could it be a multiple choice item with yet another sort of very tempting alternative choice? That is, an unintentionally ambiguous distractor that many test takers read as a correct answer option, even though the test developers did not intend it to be correct?
Could it be a multiple choice item with the converse problem? That is, an unintentionally ambiguous key (i.e., intended to be the correct answer option) that many test takers read to be an incorrect answer option, even though the test developers did not intend it to be incorrect?
Could it bee that the item presents usual language to many test takers? That is, an item whose instructions are different than how many teachers explain that sort of task, such that many test takers are not quite clear on what is being asked of them?
Could it be that the item has unrecognized fairness issues? That is, an item that includes some sort of construct-irrelevant and culturally-based knowledge? For example, use of some language that is well known to urban item developers and test takers, but not to exurban or rural test takers (e.g., bodega, bike path).
Could it be that the item targets KSAs that students often have more trouble learning or mastering? That is, the item’s low p is actually a reflection of the difficulty that students have in learning a particularly tricky or subtle lesson—something that is generally well known by teachers.
Yes, some of these explanations suggest a poor quality item. Three of them are clearly items that should not be used, because they are bad items. Two others present debatable cases about whether they are bad items. I believe that one of those is a bad item, but the other is a question that the client would need to settle. But the other six explantions are not bad items. Whether the are appropriate for a test is a question of expert judgment that needs to be calibrated against the intentions for the test.
(And none of these explanations are about the different topic of cognitive complexity, though it is of conflated with item difficulty.)
So, you see, measuring item difficulty emprically is not sufficient to understand the item. Like all psychometric tools, it is not capable of providing test takers, students, teachers or policy-makers the kind of information that they need to improve instruction and/or educational outcomes for students. It does not even provide information about the capabilities of test takers (i.e., aid in criterion-based reporting). Rather, it is entirely oriented towards comparing test takers to each other (i.e., to aid in norm-based reporting), with shockingly little reference to the targeted constructs.