The Unacknowledged Tyranny of the Platinum Standard

I just got back from an educational research conference, and as is my wont, I had a lot of conversations about assessment and educational measurement.

On the morning of the last day of the conference, as people were saying their goodbyes, I found myself in conversation with a brilliant young psychometrician still on her first job in industry. I was pushing her to consider examining the application of some sort of multi-dimensional psychometric model when she got a chance to do her next little research project. She was concerned that that might mark her as being a little weird, as the industry is so heavily invested in unidimensional psychometric models. She pulled in Yu Bao, a professor in James Madison University’s Assessment & Measurement program, who was walking by. Yu agreed with me that there’s a lot of room there for a psychometrician to make their name there with multi-dimensional models.

I went on one of my typical rants about the mismatch between unidimensional psychometric models and multi-dimensional domain models and the platinum standard. That is, the way that psychometricians bring model fit statistics to data review meetings and strongly suggest that items with poor model fit—poorly fit to the misplaced unidimensional model—be removed. (They do this with item difficulty statistics, too, but that is not as bad for validity claims as this use of model fit statistics from inappropriate models.)

This young psychometrician pushed back, however. She said that psychometrics uses unidimensional models because they fit the data better.

But that’s not true. That’s not true in practice and that is not true at research conferences. Just the previous day, a colleague of mine told me about a session that he attended—and walked out of. There, a young psychometrician was explaining the use of factor modeling techniques to something something something—I didn’t attend that session, so I do not know what he was trying to do. He showed that item 31 did not fit his model, so they removed it. They did not remove it because it was not well aligned to the assessment target or larger macro-construct. He never even looked at the actual item itself. Rather, they just removed the item because it did not fit the psychometric model they were using.

No consideration for the construct’s theorized model. No consideration of the formal domain model. Only the psychometric model.

My colleague was walking by, so I pulled him into the conversation. Yu agreed that this happens sometimes, despite what the brilliant young psychometrician had been taught and expected to see.

My colleague and I know that this happens quite a bit. Psychometricians come with their opaque techniques and intimidatingly precise numbers. Few people outside of psychometrics have the confidence to push back against people armed with something they do not understand, and that precision—and all of its decimal places—is so easily mistaken for accuracy.

The platinum standard is powerful. It shapes our assessments, and not for the better. It leads to the removal of possibly the best aligned items, simpy because they do not accord to the demands of psychometrically more simple tools and their rather unrealistic assumptions. The platinum standard forces those inappropriate assumptions on the entire field, requiring those who actually focus on the domain model and content alignment to simply accept the demands of psychometrics.

Content Development Professionals Require Different Expertise than Teachers

Because assessment is an important part of teaching, it is not surprising that content development professionals (CDPs) for large scale assessment (e.g., standardized testing)—the professionals who develop and refine the contents of tests—require many of the same skills, knowledge and expertise as teachers. However, CDPs also need other skills and knowledge to work at a high level.

Unlike teachers, CDPs do not have to worry about classroom management or lesson plan construction. But they do need a teachers’ understanding of the content area and how to think about content and learning paths. Of course, both teachers and CDPs need to understand the cognition of others without unwittingly projecting their thinking on them. 

However, CDPs need to think about these things a bit differently than teachers. Large scale assessment does not have as many opportunities to assess students as classroom practice, so it must do so much more efficiently. Teachers can triangulate lots of different information from and about students to figure out whether they understand something, but large scale assessment usually depends on a single assessment to make that inference. Therefore, CDPs need a much more precise view of evidence than teachers’. They need to be able to recognize the ambiguity of evidence so that they can create test items that can elicit more definitive evidence of a test takers’ level of proficiency.  

While teachers often focus on how to integrate the knowledge, skills and abilities found in various learning standards into larger lessons and activities, CDPs need to understand how to isolate them while still preserving some amount of authenticity. They need to be particularly mindful of the kinds of mistakes that learners make and how they relate to particular learning goals—recognizing their connection to the targeted cognition of an item.  

Like teachers, CDPs need to understand how others think—often others very different than themselves. Teachers have their students in front of them, and can learn more about them over time. CDPs have to imagine test takers, rather than being reminded of them every day by their presence. Moreover, the range of diversity and dimensions of diversity are vastly larger with large scale assessment than a single teacher in a single school must account for. We call this radical empathy because of the amount of variation in background, experience and perspective that CDPs must be able to anticipate. 

CDPs also require technical knowledge and skills that teachers do not. CDPs need to know the differences between different item types, how they work, and which are most appropriate to elicit evidence of different sorts of cognition. They need to be able to recognize problems in a multiple choice item and how to make it better elicit evidence of the targeted cognition for the range of typical test takers. The fact is that it is incredibly difficult to create high quality multiple choice items that produce high quality evidence, a fact that makes it more important to take those challenges seriously. 

They need to understand the workflows, contributors and collaborations that comprise the test development process. Moreover, they need to have the ability to push back against the various pressures to alter items in ways that compromise they ability to elicit evidence of that targeted cognition for the range of typical test takers—or even omit them entirely from an assessment. Of course, all of this requires understanding the values and thinking of the the many different disciplines that contribute to these collaborations.

I would never suggest that CDP work is more difficult or complex than teacher work; clearly it is not. Working with children—of any age—and being sensitive to their needs is enormously challenging and complex work, made more so the official and unofficial learning goals. All of those challenges are magnified exponentially by the reality of how many children are there at the same time. 

However, the work of developing the contents of standardized tests is itself complex and difficult, mostly in ways unappreciated by the public—and even by others involved in large scale assessment. It leans on many areas of skill and knowledge that overlap with teachers’ expertise, but it has different goals and constraints. Therefore, it also requires different expertise—including, but not limited to—expertise in the content area.

Content Development Work Requires Far More than Just Content Expertise

There is a cynically and incredibly foolish expression, “Those who can, do; those who can’t, teach.” Yes, it is grounded in the idea that many teachers are not the deepest content experts. However, it is foolish for two reasons. First, it entirely misses the fact that good teaching requires its own set of skills—skills that mere content experts usually lack. Second, and perhaps less obviously, it misses the fact that teachers must have a different relationship to the content than mere practitioners—even those at the highest level of experts who practice with the most nuance, skill and wisdom.

Teaching requires thinking about the content, holding it at arms length, rather than just using it. Some (e.g., teachers) call this meta-cognition, thinking about thinking. Being able to do does not require consciously understanding what it is you are doing or being able to communicate to others. In fact, that kind of thinking can get in the way of fluid skillful practice. It does not require understanding how others might do the skill. Teachers have to understand the kinds of mistakes that learners make, and the different learning paths towards proficiency. 

Content development work for large scale assessment (e.g., standardized testing)—the work of crafting and refining the contents of tests—requires many of the same skills as teaching. It requires thinking about the content. Like teachers, content development professionals (CDPs) need to understand how others understand the content, and the ways in which they might misunderstand or misapply it. They need to be able to recognize their own thinking and learning path, but not be so self-centered as to assume that it is the only learning path. Like good teachers, they need the empathy to imagine the cognitive paths of others—including those with vastly different backgrounds and experiences. 

Yes, and like teachers, CDPs need content expertise. And like teachers, they need far deeper content expertise than most people realize. They need to understand how the content fits together and how it is applied in practice. They rarely have the fluidity of a practitioner’s mastery at the highest level of professional practice, but they have deep understanding of content, nonetheless. 

And they also need many of the expertises of teachers, in addition to their own particular additional skills, knowledge and abilities

Has IRT Destroyed Old School Validity?

When I first learned about the measurement idea of validity, I was taught that it is about whether the measure actually measures the construct. I was taught that validity pairs with reliability, which is about how consistent the measure is. Reliability is like the margin of error of from one measurement to the next, but validity is whether you’re aiming of the right target. I have had this definition in my head for…decades? I think I first learned about this in a psychology class in the 1980’s. 

When I came to the field of educational measurement this century, I found a somewhat different definition. Since the 1999 Standards for educational and psychological assessment (AERA, APA, NCME), in educational measurement validity is about whether there is evidence to support the use of a test or measurement for a particular use. We all stress that validity is no longer a property of the test itself, but rather is a property of the particular test use. And there are different kinds of evidence that can contribute to this idea of test validity.

That shift to attention on particular test uses is really important. When tests are repurposed, they might no longer be appropriate. For example, a very good test of 4th grade mathematics is simply not a good test of 6th grade mathematics. It is not that the test has changed, but rather that its use has changed. So, the old validity evidence for the old use is no longer germane. 

I buy that. I really do. But I still have in my head the issue of the basic inference. That is, totally apart from test use, does the test actually measure what it claims to measure? Are the inferences we make about test takers…valid? I think that that still matters.

In fact, I think that whether the tests measure what they are supposed to measure is the real point. I think that that old school idea of validity as simply the question about whether the test measures what it is supposed to measure is critically important. If it does then appropriate uses are kinda of obvious. And inappropriate uses are also kinda obvious.

So why the shift from the 1985 Standards to the 1999 Standards?

I have a theory that is probably incorrect. But it’s in my head.

For decades, the statistics behind big standardized tests has been based on something called IRT (item response theory) and before that it was based on CTT (classical test theory). Each of these generally reports a single score that is useful for sorting and ranking test takers. No matter how many different elements the test is supposed to measure—like the different standards in a domain model—they each report a single unified score. However, for them to work reliably, test developers remove potential test items that seem like they might be measuring something a little different than the other items. So, the better that each item does at measuring its targeted standard, the less likely that item is to be included. The more an item instead kinda measures some muddled middle idea of the construct, the more likely it is to be selected. Psychometricians call that “model fit,” and the model is usually unidimensional IRT or CTT. 

When there is a conflict between a multi-dimensional domain model (e.g., the different knowledge, skills and abilities that go into a set of standards) and a unidimensional psychometric model, modern educational measurement favors the unidimensional model—throwing aside items that better fit the domain model than the psychometric model.

As a content person, I have never been able to figure out what that mushy middle means. On a 4th grade math test, it’s some vague idea of math ability…but it’s not clear which aspect of math ability factor in and which do not. It might include ability to learn math. But how much? It might include computational accuracy. But how much? It might include problem solving ability, but how much? Or even reading ability! Because model fit statistics lead to the removal of really strongly targeted items (i.e., as opposed to items that lean towards the muddled middle), I don’t think we could ever know.

These technique produce a seemingly definitive ranking of test takers with seemingly definitive quantitative scores—often to many decimal places. But it is never clear what they are ranked on. Something about math…but what? They most definitely are not a thoughtfully weighed composite score when IRT is combined with item selection and model fit statistics. 

Which takes me back to the question of old school validity vs. news school educational measurement test validity. Was the change necessary simply because we never know what IRT is scoring students on, from a content perspective? That is, IRT results are not interpretable through the lens of the construct, so we not longer focus on the inference?

That’s what I am thinking about, these days.

Are we measuring the right construct?


Imagine that you are in a kitchen and need to measure the volume of some odd solid object, or the difference in volumes between two odd solid objects. But the only real measuring tools are scales (i.e, a kitchen scale and a bathroom scale) and any number of household tape measures, rulers and yard/meter sticks. And the internet is down.

* One approach might be to simply take the mass of the object(s) and figure that most things have a mass of around 1 g/cm3, and go with that. If you need the difference, take the difference. 

* Another approach might be to do that Archimedes thing and try displacement. Fill up a cup or larger container to the bring with water, drop the object in the cup and catch all the water that the new object forces out of the cup. That would take a saucer (or serving platter) under the vessel to catch the water. Measure the mass of that saucer (or serving platter) empty and with the water. Eureka! The difference is the volume, so long as you convert the units, right? So clever, that Archimedes. 

* The third, and hardest approach would be very much like the second approach, but it departs from the Archimedes version, because these objects are not gold crowns. You’d need to push the object down into the water, making sure that it is entirely submerged—but without putting anything else in the water. Either push it JUST under the water, or use some very very fine tools to hold it down further. Again, calculate the mass of the displaced water and convert the units. That’s the mass of the object, and just subtract the lower mass from the greater if comparing two objects.

The third approach is way more clever than the first two, and is the only one that will actually give you volume. The first approach approximates volumes, but will not work for objects that easily float or sink—signaling a density significantly different than water’s. The first approach just gives you mass. The second approach will work for denser objects, which do entirely submerge in the water, but not for objects that float (i.e., are not entirely submerged). For the former, yes volume. But for the latter it just gives mass again. Not actually as clever as we thought. 

(Archimedes’s experiment as a bit different, and he had a whole bunch of spare gold lying around. Neither you nor I have that available for our work.)

I have no doubt that there are many people who think that psychometrics is analogous to the third approach. That it really is clever enough to take the products of limited tools to measure difficult constructs. But what I see is  a dependence of limited tools that simply measure something different than the intended construct. And, no, the analysis is not so clever as successfully to convert the results to the intended construct. Disturbingly, it is not that adequate tools are not available, rather it’s the insistence on using unidimensional psychometric models and filters to measure multi-dimensional constructs. There are other models, they just are not favored. Perhaps they are not as easy to use. Perhaps they don’t have the established place in curricula and/or practice. Perhaps it is simply if we've always tended to use a hammer, we tend to redefine problem into problems that can be solved with a hammer. 

But the charge of testing is to measure the intended construct, not some other construct that our favored tools are better at measuring. 


The Worst Reasons to Reject Change?

Ages ago, when I was in grad school, I learned from Susan Moore Johnson that many people incorrectly cite union contracts as a reason why something must be done, why a practice cannot be altered or a a innovation cannot be picked up. As she explained it, people do not actually read the contract, so rumors about what is in it are often even more powerful than the contract itself. This was eye-opening to me at the time, and it has stuck with me. But I think that it was simply too polite an interpretation. Yes, that is often the case. But I think that sometimes—perhaps even more often—it is a willful ignorance. Some people do not care what is in the contract, and are simply invested in arguing against change—regardless of whether their arguments are made in good faith.

Whether the argument that something is in the contract is made in good faith is itself a contentious question. So, we can put that aside. Regardless of whether it is made in good faith, it is often an erroneous argument used to push back against those calling innovation or change. 

I have come across this exact same tactic in other contexts as well. 

* About 10 or 15 years ago I was trying to report a bug to Apple in some piece of their software. The level II support specialists I was speaking with went up to a level III support specialists and came back to me with an excuse. The way I was using the software, he explained, violated the end user license agreement. It wasn’t a bug, you see, it was a misuse. But I knew that couldn’t be the case, so I opened up the very long end user license agreement while on the phone with him and went through it, looking for anything relevant to his point. Of course, it wasn’t there. This was a moral victory, as he had to admit that his superior’s excuse was untrue. (I do not know that it got the bug fixed any faster. I switched to a third party application and have not felt a need to go back to Apple’s app for that use.)

* Just last month, a regional chain with ~75 stores opened a brand new supermarket near me—now my nearest supermarket, just 14 minutes away. Unfortunately, there are handful of operations mistakes that make shopping there just a little more annoying than it needs to be, but they hit me every time I go there. I have mentioned them to the "customer experience manager" and last week I saw a team of muckety mucks from headquarters going through the store to make a list of lessons and things that might be fixed. I took the opportunity to mention a couple of my concerns. As I was checking out, one of them came to me to thank me for my feedback. He said he was the head of store design for the chain. I took the opportunity to share another concern, one that would actually take a little—just a little—money to fix. While I was talking with him some assistant manager (from another store) came up to defend the chain’s honor. He started making excuses that I knew weren’t true. Eventually, he said that they couldn’t fix the problem because of the ADA. For me, that was too much. 

I replied to him, “You mean that the Americans with Disabilities Act, signed into law in 1990 by President George Herbert Walker Bush (and perhaps amended since then) has a provision in it that says what side of the self-checkout machine the groceries go on? I bet you $100 right now that that it not true.” It actually wasn’t the first time I had used that line about making such a $100 bet that day. Just earlier, when I was talking to an another assistant manager in the store, he said that 80% of people who go to the store end up in the refrigerated section, and I knew that could not be true. (The head of store design confirmed that it wasn’t at all close to 80%, and they didn’t want it to be.) This wasn’t even the first time that they mentioned the ADA, as that assistant manager also tried to invoke it to push me off another point I had tried to share. 

I have observed that the federally required peer review process for many large scale assessments is also an intimidating citation used to push back against innovation or improvement efforts. People claims that because a test has already gone through peer review, no processes or documentation can be improved. People claim that some innovations cannot be used or applied because it will never get through the peer review process. It’s just the same damn dynamic.

What I see over and over again is people who simply are against change, do not want to alter what they already know and are comfortable with and are not invested in improving the product or process. But instead of going through an accurate analysis and/or give real reasons to oppose change, they simply grasp for some powerful authority that they can claim is the unmovable obstacle. They do not have to own their opposition, and they certainly do not have to think deeply about evaluating the proposal. They do not even have to have power to reject the change. Instead, they claim some expertise about that other thing that is the real obstacle.

But I know what the ADA is. I know what goes in EULAs. SMJ taught me to actually read union contracts. And I even know enough about peer review to know that it is not the obstacle that it is made out to be. It was not meant to be an obstacle to improvement, and really doesn’t have to be. 

More generally, I do not think I am ever going to get over my fury when people try to prevent change by hand waving at intimidating-seeming authorities that they do not even understand. Again, it hardly matters whether they know better, because their ignorance is willful and the citation of authorities they know little about is intentional. It is just fear of change, fear of thoughtful deliberation and an unwillingness to take responsibility for maintaining a their preferred (and problematic) status quo.

What We Mean When Talk About 'Reliablity'


One of my little pet peeves is when athletes say that they to be more consistent at something that that have long been consistently mediocre or bad at. I agree that they need to be better, but I don’t think that the problem has been consistency. Heck, a basketball player going from a 23% 3-points shooter to a 35% 3-point shooter has not even become more consistent, even though that would constitute a rather large improvement. 

Words have meanings, and while I love metaphorical language, when words with rather precise meanings are expanded, our ability to express precise things is diminished. I find that frustrating—perhaps because I was raised by a lawyer and perhaps because I was such a math and science kid growing up.

But the fact is, that words can have different meanings in different contexts. This is certainly true when words have technical meanings in expert fields and also have lay meaning for the general public. 

Reliability is one of those words, and it is a very important technical word in the field of educational and psychological testing. And yet, it is also a middle school level word that refers to trustworthiness. 

In everyday use, a reliable person is someone you can trust to be there and to do the right thing. It is not just consistency, but also usefulness and worthiness. 

But the statistical term, as used in many technical fields, merely means consistency. Something can be consistently off by the same amount, and that would be reliable. Statistical reliability is only about consistency, regardless of appropriateness, precision, or actual accuracy. Under this definition, a car that only—but always—starts up when it is over 90 degrees outside is a reliable car. Moreover, it would be more reliable if it only started up when the temperature was over 100 degrees, and most reliable if it never started up at all. After all, that would be perfectly consistent—consistency useless.

This is a particularly important gap in meaning in my field because when psychometricians insist on maximizing reliability, that sounds really good to a lay audience who does not appreciate the difference between the everyday term and the statistical term. Psychometricians want consistency, even if it means consistently the wrong thing or leaving out the most important stuff because it hurts the consistency of the test. They say they are increasing reliability, and they are not lying. Heck, I am sympathetic to their use of the term as it is also what I tend to think the term means, too. That math kid started taking advanced Probability and Statistics courses nearly 40 years ago. That meaning is what I have in my head when basketball players talk about being a more consistent shooter.

I think a lot about important technical problems with how psychometricians’ focus on that statistical idea of reliability leads to worse tests, but perhaps the bigger problem is how their different meaning of reliability misinforms policy-makers and the broader public about what they are even trying to do. The broader public and policy makers are the real audience for our tests, and we should be mindful of how they hear what we are saying. 

Not Being Mind Readers, There Are Things We Cannot Know

We need cognitive tests because we cannot read people’s minds. Instead, we have to find evidence that supports various inferences we might want to make—or that refutes them. This dilemma of not being able to read minds is not limited to testing.

Hanlon’s Razor

Hanlon’s Razor advises, “Never attribute to malice that which is adequately explained by stupidity.” But the idea goes back much further. In the 19th century, H. G. Wells wrote, “There is very little deliberate wickedness in the world. The stupidity of our selfishness gives much the same results indeed, but in the ethical laboratory it shows a different nature.”

In my teen years, I thought that the worst trait a person could have was incompetence. Nope, I’ve never been fun at parties. I certainly have that history of seeing incompetence around me. But for the last few years, I have been faced with a situation outside of my professional life that I attribute to malice. Others whom I respect agree, but temper it with judgment of some amount of incompetence. Certainly, many people seem quite unwilling to see this particular form of malice.

How can I know? How can any of us know? We cannot see inside the hearts and minds of those around us, not even that one woman.

Optimizing political Strategy

In the days immediately after Joe Biden’s disastrous debate performance in his 2024 debate with Donald Trump, I cautioned those around me that it would be foolish for Biden to drop out of the race before the RNC nominating convention. There would be no way to take away coverage from the RNC, and it would be wise to let Trump’s Republican Party waste its powder on attacks on Joe Biden and his age when they had maximum free coverage by the news media.

I cautioned that no one is going to remember a few weeks in July when we actually get to November. The DNC nominating convention was still over a month away, and there was not much in July or August that would matter by Election Day. Electoral campaign memories are short, perhaps unfortunately short.

I said that the optimal strategy would be to wait until…July 19 or 20 for Biden to drop out of the race. That would be weeks before the the Democrats officially nominated their candidate. I didn’t want a traditional circular firing squad, and thought the best strategy would be to go with Kamala Harris—though she was not my preferred candidate in 2020.

Obviously, for this bait and switch strategy to work, there could not be leaks. Anyone who might leak anything to the media had to be ignorant of the plan. Thus, people in the know could not tell their aides—or perhaps even their spouses. For this to work, Joe Biden would have to look stubborn—even as the pressure mounted from people who did not know the plan

I did not anticipate that Biden’s delay would build up such energy for replacement that a politician who produced so little excitement in 2019 and 2020 would be as well received as Vice President Harris’s candidacy has been. And I thought that Navel veteran and Rhodes Scholar, obviously conservative family man who is comfortable debating Republicans on FOX news Pete Buttigieg would be a great running mate for her.

In fact, I was off by a day. President Biden dropped out of the 2024 presidential race on July 21. But otherwise…was my prediction wrong?

How can I know? How can anyone know? This had to be a no leaks plan. It would require Biden to look like an old grandfather who absolutely refuses to give up his car keys. He would have to take the further reputational hit in order to help his party to retain the White House.

What were Joe Biden and his closest most trusted advisors thinking? Could the greatest political strategist of this century, Nancy Pelosi, have come up with this weeks ago? Could President Biden have gathered the Clintons, Obamas and her for a serious strategy session? (Not Chuck Schumer. I do not trust that he would not leak.)

How could we possible know the truth? Even if word leaked in the months or years ahead that this was planned, why should we trust that? It would make Democratic leadership look brilliant, so there is real motivation to leak such claims after Republicans cannot do anything about it. I do not know what to conclude, and I do not know that I ever will. (Even I do not think that the plan—the conspiracy—could go further back than the debate, but how can I be sure…?)

I cannot read anyone’s mind.

The Challenge of Intellectual Humilty

I really try to be intellectual humble. I try to be aware of what I think and why. I try to be conscious of what I really know and the absolute facts available to me. I try to be mindful of when I run up the ladder of inference, even if it is just a single rung.

Yes, it would be validating for me to conclude that that woman is motivated by malice, rather than just stupid. Yes, it would be satisfying to me to think that our political leaders are brilliant, rather than just bumbling.

But I cannot read minds. I need to live with that uncertainty. At the same time, I need to look for whatever confirming and disconfirming evidence might be available—both professionally and in the other spheres of my life.

They Are All Norm-Base Tests, Brent

Track & field’s 100m sprint is a norm based test. Though it is not a cognitive test, it exhibits so many of the causes and symptoms that make norm-based tests so problematic. It is a test designed to rank participants by giving arbitrary weights to a collection of related skills and then claim a definitive result, in large part through the use of numbers. From the 100m sprint, we get a declaration at the Olympics of who is the World’s Fastest Man and Woman. But I just don’t buy it.

I have learned through the years that there are three main phases to the 100m spring. First is the start, then the acceleration phase and finally..well, I see and hear it called different things. The constant speed phase. The maintain phase. Whatever. The name is not important. What is important is that different sprinters have different strengths. Sure, if you are the best in the world at all three of them you are going to win, but that is quite rarely the case. Sha'Carri Richardson is stronger at the third phase than then first phase, as was Carl Lewis. 

When I was growing up, we did the 50 yard dash. The National Football League judges speed with a 40 yard dash. Indoor track has a 60m event.  International Track & Field does not use a 100 yard race, but rather a 109.36 yard race (i.e., 100m). Why these differences? Ummmm….well, one could offer different reasons to support one distance or another, but there’s no definite best answer. It is arbitrary which one we should use or respect most. However, the longer the distance, the more important that third phase is, and the shorter the distance the more important the other two phases are. 

When I was growing up, we did not get to use starting blocks. In fact, we had to begin from a standing start. Why prefer starting block or a standing start? There are reasons for each, even good reasons for each. One could go either way, so the decision is arbitrary. 

This is no different than big math or reading tests. Math and reading are each made up of a large variety of skills. How much should the SATs or the ACTs depend on calculation skills? How much on solving algebraic equations? How much on making sense of word problems? How much on drawing graphs and how much on reading graphs? Obviously, there are more skills than that, and there’s no definitive reasoning for how we should weigh them in order to come up with a final singular score. 

Any test that offers a final singular score is intended to sort and rank test takers. This totally makes sense at the Olympics and other sports competitions. But it is just about useless when it comes to teaching and learning. A track coach is not going to learn anything about what to tell an athlete by looking at 100m times. It says nothing about what they are good at, what they are bad at, what mistakes they are making, or where they might most benefit from further instruction or practice.

Normative tests are good for the final competition and useless for teaching and learning. 

Break down the three phases of the race into separate times and the coach can use their expert knowledge and experience to zero in on what phase is the weakest. Allow the coach to actually see their work (i.e., watch the race) and they can break it down further and zero in on useful coaching. 

But you can’t crown the World’s Fastest Man (or Woman) if you break it down like that. Which phase counts most? Should we focus on top speed? Best time over their fastest 20m? Fastest acceleration? Should the maintain phase be like 20m, or like 70m? 

Any test that reports a singular score is meant to sort and rank students. We cannot do that with profiles of proficiency with an array of different skills, but that is a whole different kind of test. And yet, those profiles are what are useful for teachers and students, what is useful for teaching and learning. 

Unfortunately, while not all standardized test are normative tests reporting a score of some arbitrary composite of different skills, almost all of them are. And that will always be a problem, perhaps even an obstacle to the process of education. 

What is the Purpose of Educational Measurement?

Well, that’s a bad title. I mean, I know what the purpose of educational measurement is. It is to report on status of test takers’ proficiency with particular knowledge, skills and/or abilities (i.e., KSAs), and perhaps to report on improvements in their proficiencies (i.e., learning).

And it is to do so in quantitative terms. Not all educational assessment is about quantification, but educational measurement is. I accept that.

So, what I am really wondering about is the scholarly, academic and researchy field of educational measurement. This field includes professors and others at universities, vast numbers of professionals working in industry (i.e., in both for-profit and non-profit organization), folks working in government departments of education (i.e., local, state and federal), and even solo practitioners (like me).

I am asking about the researchy stuff. I am asking about the purpose and goals in advancing the field of educational measurement. This is the stuff of academic journals and a variety of types of conferences. No, this is not the every day work of teaching or developing tests. Rather, this is the most creative and intellectual part of the field, where the state of the art gets created and pushed further. Where the field learns, grows and advances.

What is the learning and growth oriented towards. What is its purpose? It’s a somewhat large field, so I suppose that there can be a lot of goals, depending on the particular interests of the researcher and grant makers.

So, let me come at this from another direction.

Since the original edition of the handbook Educational Measurement (Lindquist, 1951), we have seen huge advances around the world in educational attainment and equity. Simply vast. In the United States we have seen an incredible lowering of the drop out rate, even as we have created state standards and even raised those standards.

I am not questioning the contributions of educational measurement to those advances, at least not today. Rather, I ask whether the advances in educational measurement in the last 70 years have been important to those incredible advances in education rates around the world?

If they have, I would love to know how. And if they have not, which I strongly suspect is the case, why not? What have 70 years of advances in the field of educational measurement been for, if not improving education for students, for communities and for nations?

I would really like to know.

Distractors Matter: Manipulating Item Difficulty with Distractors

One might think that the main determinant of a multiple choice item’s difficulty is the set of KSAs that a an item is targeting. One might think that item difficulty can be spotted through an examination of the stem (i.e., the item’s question or prompt). But one would be wrong.

The most important determinant of item difficulty is the distractors (i.e, the incorrect answer options).

An item without plausible distractors is going to be an easy item. That is, an item whose distractors can all be quickly and easily dismissed—even by those without the command of the targeted cognition—will be easy. We call this low bar for plausibility shallow plausibility or surface plausibility. Distractors must at least be shallowly plausible, and yet they often are not.

An item whose distractors are all shallowly plausible and deeply plausible will be a more difficult item. Deeply plausible distractors are those that require working through the item to dismiss, because they follow from mistakes in applying that targeted cognition.

The most difficult items often have distractors that are quite similar to the key (i.e., the correct answer option). They might differ in some subtle way from their corresponding key. They might rely on a minor point in a text to differentiate. They might be a good answer option, just not the best answer option. For example, they might not be false, and yet they might not contain as much truth as the key. Therefore, they might look like a good answer to a test taker who does not check all the answer options.

None of these possibilities involve changing the targeted cognition, the stem or the key. And yet, these different sorts of distractors or distractor sets can radically alter the empirical difficulty of an item.

Heck, distractors are so powerful that they shift the meaning of the evidence that an item collects from the targeted cognition to some other KSA(s).

Distractors Matter: Answer Option Order and Cognitive Complexity

Multiple choice (MC) items are not merely questions with a bunch of supplied answer options. They do not operate like constructed response (CR) or open-ended question. Unfortunately, too few people understand and appreciate how important answer options are to understanding how people respond to MC items.

While the famous Haladyna, Downing and Rodriquez list of item writing guidelines say that answer options should be placed in a logical order, they do not address the impact of that oder on how test takers work through an item. Sure, answer options could be ordered by length, put in alphabetical order or some sort of chronological order. But whatever rule one follows, it can have unintended impacts on the cognitive path that test takers work down to come to their choice.

Simply compare the cognitive path of placing the correct answer (i.e., the key) in the first position or in the last position.

For items such as mathematical calculations, putting the key first allows the test taker to skip all the other answer options entirely. But if the key is the last answer option, the test taker must consider (and perhaps compare to their answer) each of the other answer options before recognizing the key at last.

But if the item is perhaps less black and white, the test take might have to try to interpret and make sense of answer options, comparing each to their own thinking. If the key is first position, the test taker can quickly come to a sense that it matches their answer, select it and then move on. But if the key is last, the test taker has to figure out whether each distractor (i.e., an incorrect answer option) means what they are looking for, or whether it means something else. As they move through the list, they might lean a little bit more into a “Well, does it kinda mean the same thing?” sort of thinking.

Clearly, the order of answer options can impact how long it takes test takers to work through an item, and the sort of thinking they need to do—even without changing anything about any individual answer option. Different test taker strategies can also influence these, but distractors matter.

Obviously answer option order impacts the how test takers respond to items, right?

What is "Disciplinary Arrogance?"

Some disciplines seem more arrogant than others.

By that, I mean that some disciplines seem more willing than other disciplines to take their toolbox and lens and apply them to problems that they were not created to address.

By that, I mean that some disciplines seem less aware than other disciplines of other related disciplines and their toolboxes and lenses.

By that, I mean that some disciplines seem more dismissive than other disciplines of the answers and discussions generated within other disciplines.

Obviously, no discipline is intrinsically arrogant or humble. The tools, lenses and filters of any discipline lack anything like arrogance or humility. To be honest, those are just the wrong traits to apply to a discipline.

But by disciplinary arrogance, I do not merely mean that some people are more arrogant about their discipline than others. I do not refer to individuals. Rather, I think that it is something cultural, something that exists within communities and social groups, is shared and is passed on to future generations.

Perhaps the most arrogant discipline is economics. Economists seemly think that their toolbox applies to all problems and can generate useful—and perhaps even wise—answers to virtually any real world question. Heck, economists have even named their toolbox (i.e., “econometrics”) to make it easier for others to use.

Well, they actually took a bunch of statistical tools used across many disciplines and redubbed them collectively “econometrics.” Even when those statistical techniques are applies to data that is not economic in nature, economists still call it econometrics—as though they invented the tools.

Obviously, disciplinary humility would be the admission that the tools and lenses of a discipline are not the best tools to analyze a problem or situation. Once again, this is not a trait of the tools, but rather something cultural across the membership of a discipline.

Economics is not the only arrogant discipline. Clearly, arrogant disciplines perpetuate their attitude as novices are acculturated and educated into the discipline. Therefore, it is something that can be addressed ore even moderated, were the field to believe it appropriate to do so.

But how likely is that?

The Ambiguity of Item Difficulty

In the world of standardized assessment, item difficulty is empirically measured. That means that it is not a reference to conceptual difficulty of the  KSAs (knowledge, skills and/or abilities) that the item draws upon. Nor is it a reference to how advanced those KSAs are thought to be.

Rather, item difficulty is measured. It is the percent of test takers (or field test takers) who answered the item successfully. The math is a little more complicated for polytomously scored items (i.e., items for which test takers can receive partial credit), but the same basic concept holds. The statistic, p, is simply the percent of available points that test takers earn across the population. 

This makes the calculation item difficulty easy. However, it makes the meaning of item difficulty rather…ambiguous. 

Imagine some chemistry or mathematics item with a low p. That is, a relatively more difficult item. What does that low p tell us?

  • Could it be that item is drawn from later in the course sequence, so test takers have had less time to practice its notable KSAs it and build upon them? Perhaps so late that some test takers’ classes had not covered it yet when they took the test?

  • Could it be that the item is from an early topic, but is a particulary nasty application of the KSAs? That is, an application requiring a greater level of skill or understanding with the KSA(s) to answer successfully, as opposed to a more routine application?

  • Could it be that the item combines a such a variety of KSAs that it provides an unusually large number of opportunities to reveal test takers’ potential misunderstandings? That is, different test takers might fall short for a range of different shortcomings in their KSAs?

  • Could it be that item has speededness issues. That is, the item takes longer to complete successfully than most items, leading many test takers to simply—and perhaps wisely—to bail on it in order to use their time more efficiently. 

  • Could it be a multiple choice item with a very tempting alternative choice. That is, a distractor that a very common misunderstanding in the targeted KSAs perfectly captures?

  • Could it be a multiple choice item with a different sort of very tempting alternative choice? That is, a distractor that a very common mistake that is not tied to the targeted KSAs perfectly captures?

  • Could it be a multiple choice item with yet another sort of very tempting alternative choice? That is, an unintentionally ambiguous distractor that many test takers read as a correct answer option, even though the test developers did not intend it to be correct?

  • Could it be a multiple choice item with the converse problem? That is, an unintentionally ambiguous key (i.e., intended to be the correct answer option) that many test takers read to be an incorrect answer option, even though the test developers did not intend it to be incorrect?

  • Could it bee that the item presents usual language to many test takers? That is, an item whose instructions are different than how many teachers explain that sort of task, such that many test takers are not quite clear on what is being asked of them?

  • Could it be that the item has unrecognized fairness issues? That is, an item that includes some sort of construct-irrelevant and culturally-based knowledge? For example, use of some language that is well known to urban item developers and test takers, but not to exurban or rural test takers (e.g., bodega, bike path). 

  • Could it be that the item targets KSAs that students often have more trouble learning or mastering? That is, the item’s low p is actually a reflection of the difficulty that students have in learning a particularly tricky or subtle lesson—something that is generally well known by teachers. 

Yes, some of these explanations suggest a poor quality item. Three of them are clearly items that should not be used, because they are bad items. Two others present debatable cases about whether they are bad items. I believe that one of those is a bad item, but the other is a question that the client would need to settle. But the other six explantions are not bad items. Whether the are appropriate for a test is a question of expert judgment that needs to be calibrated against the intentions for the test.

(And none of these explanations are about the different topic of cognitive complexity, though it is of conflated with item difficulty.)

So, you see, measuring item difficulty emprically is not sufficient to understand the item. Like all psychometric tools, it is not capable of providing test takers, students, teachers or policy-makers the kind of information that they need to improve instruction and/or educational outcomes for students. It does not even provide information about the capabilities of test takers (i.e., aid in criterion-based reporting). Rather, it is entirely oriented towards comparing test takers to each other (i.e., to aid in norm-based reporting), with shockingly little reference to the targeted constructs. 

The misunderstood relationship between validity and reliability

The foundational psychometric mistake is that they behave as though—and perhaps believe that—reliability is a sort of fertilizer for validity. That is, in practice the mistaken disciplinary view seems to be that reliability leads to validity. But that has the causal relationship backwards. In fact, validity leads to reliability. But validity is not the only factor that can lead to reliability, and that is where the problems come in. Efforts to increase reliability can be orthogonal to validity, or even come at the expense of validity.

Read More