Difficulty and Rigor

“Difficult” is a problematic word, in the world of teaching, learning and assessment. It refers to many related ideas, but because they all use the same word people often conflate them. (I just learned this year that that is called a jingle fallacy.)

As a former high school teacher, I can think of a whole bunch of different meanings that I might have intended from time to time.

  • Something is difficult if it takes a long time to do.

  • Something is difficult if it is unlikely to be successful.

  • Something is difficult if most people—or at least very many people—are likely to fail at it.

  • Something is difficult if I, as a teacher, have to spend more time teaching it in order for my to student develop proficiency.

  • Something is difficult if it marks such a change from earlier lessons that I need work hard teach my students to adopt a different mindset for it than they’ve had before.

  • Something is difficult if it is easily mistaken for something else, and therefore likely to attempted with the wrong tools.

  • Something is difficult if a person simply does not have any experience with it.

  • Something is difficult if someone has never had to do it themself before.

  • Something is difficult if few people will ever be able to do it.

  • Something is difficult if the path to gaining proficiency is quite long.

  • Something is difficult if precursors are not taken care of.

This incomplete list contains many different ideas, some of which overlap, some of which address radically different aspects of difficulty than others. Some of them might be viewed as contributors to difficulty and some as different conceptions of difficulty.

Large scale assessment (LSA) has a very particular idea of difficulty. In this context, difficulty is measured empirically. It has nothing to do with teaching effort or learning paths. Rather, it is simply the share of test takers who responded to an item correctly. Concepts and lessons do not have difficulty, just individual items. Because seemingly minor alterations to an item can radically alter how many test takers answer successfully, difficulty must be measured through field testing and monitored through operational use of an item.

In a teaching and learning context, however, this empirical difficulty is not actually a fixed attribute of an item. It is the result of the efforts of teachers and students. When teachers spend more time on some difficult content or lesson—or perhaps come up with a great new way to teach it—students can become more successful learners and then more successful test takers.

Ross Markle explains that educational (and other) interventions can undermine the psychometric properties of tests (and items), including when those interventions are prompted by those test. An item might be quite difficult (empirically) one year, but because teachers respond to that with new efforts the next year, test takers might be much more successful the next year. Other items like it might be empirically much more difficult in later years, perhaps to the surprise of test developers.

Dan Koretz has long pointed to how teachers respond to test items by altering their teaching to better prepare students for precisely those approaches to assessing content. Alternatively, one reasonable use of LSAs is to evaluate whether curricula support the development of targeted proficiencies. Thus, this kind of feedback loop can undermine teaching and learning, and it can support teaching and learning.

(Of course, all of this violates psychometric assumptions of unidimensionality, but that’s neither here nor there. )

From time to time, there is great talk about needing more rigor in school and especially in assessment. That’s the word that people use, “rigor.” But we think that people really mean difficulty. And we think they mean empirical difficulty. They want harder tests that produce more failing results. They appear to mean that a better test is one that produces lower scores.

We think that that is garbage. We think that a better test is one that better reflects the actual proficiencies of test takers with elements of the domain model—such as the state learning learning standards for that subject and grade.

And frankly, we do not think that rigor is about empirical difficulty or conceptual difficulty. We do not approve of using “rigor” as a euphemism for “hard.” Rather, we think that rigorousness is something like:

  • extremely thorough, exhaustive, or accurate;

  • strictly applied or adhered to;

  • severely exact or accurate; precise.

Perhaps large scale assessment should be rigorous. They certainly should be accurate. We would prefer that they be exhaustive, but do not see that they are given enough testing time and opportunities to do that. But those seem like reasonable goals.

Their empirical difficulty should be driven by the nature of the content they are supposed to reflect and the teaching and learning of that content. It should not be a goal of the vindictive or be targeted by psychometric considerations that are not deeply based in issues of content, of teaching and of learning.

Certainly, however, test development should be rigorous. Our test development processes should be demanding, and our own professional standards should be strictly adhered to. That is where we would—and do—apply the word “rigor”