The Original Sin of Large Scale Educational Assessment

The Standards for Educational and Psychological Testing explain five "sources of validity evidence,” on pages 13-21.

  • Evidence Based on Test Content

  • Evidence Based on Response Processes

  • Evidence Based on Internal Structure

  • Evidence Based on Relations to Other Variables

  • Evidence for Validity and Consequences of Testing 

Only one of these is really about even moderately sophisticated psychometrics: Evidence Based on Internal Structure. The others are either content based or rely on other sorts of statistical techniques. But evidence based on internal structure gets at some real issues in psychometrics. It is easy to understand, as it has the shortest explanation of the five potential sources of validity evidence. For example, the first of its three paragraphs says:

Analyses of the internal structure of a test can indicate the degree to which the relationships among test items and test components conform to the construct on which the proposed test score interpretations are based. The conceptual framework for a test may imply a single dimension of behavior, or it may posit several components that are each expected to be homogeneous, but that are also distinct from each other. For example, a measure of discomfort on a health survey might assess both physical and emotional health. The extent to which item interrelationships bear out the presumptions of the framework would be relevant to validity (p. 16).

And yet, the practice of developing, administering and reporting large scale standardized educational assessment seems to have mostly abandoned this form of validity evidence—the only form that really gets at psychometric issues. 

Straightforward examination of domain models (e.g., state learning standards) immediately reveals that these tests are supposed to measure multi-dimensional constructs. Those who know the constructs and content areas best are quite clear that these constructs (i.e., content areas) are multidimensional, with different students doing better in some areas and worse in others. They require an array of different sorts of lessons and ought to be measured with an array of different sorts of questions. 

I was taught that this kind of psychometric analysis is really about factor analysis of some sort. Which items tend to lean into which factors—dimensions—and then qualitative content-based analysis to confirm that this is as it should be. Heck, the basic question of whether the hypothetical dimensionality of the construct is reflected in the empirical dimensionality of the instrument…well, I was taught that that is really important. And The Standards seems to say that, too. 

But instead of ensuring that the dimensionality of the instrument matches the dimensionality of the domain model, the dominant mode in large scale educational assessment has an almost knee-jerk reliance on unidimensional models. Heck, items that fail to conform to this demand are discarded, as model fit statistics are the ultimate determinant of whether they can be included on a test (form). Such statistics are used to ensure that the dimensionality of the instruments does not match that of the construct. 

This use of such statistics combine with the use of unidimensional models to ensure that tests are not valid, by design. It ensures that domain models will be reread, reinterpreted and selected from only insofar as they can support the psychometric model. The tail wags the dog. 

There are many issues with large scale assessment that cause educators, learners, parents and the public to view them as “the enemy,as Steve Sireci observed in his 2000 NCME Presidential Address. But if I had to pick the single most important one, this would be it. Multiple choice items are problematic, but it quite often is possible to write good multiple choice items that i) reflect the content of the domain model, ii) prompt appropriate response processes, iii) combine for an internal structure that resembles that of the domain model, iv) combine to have appropriate relationships to other variables, and v) support appropriate inferences and consequences. But none of those are possible while insisting that items and tests are not allowed to match the structure of the domain model. This is not simply about ignoring the domain model, as some sort of neglect. Rather, this is active hostility that affirmatively bars using it as the primary reference for test development.   

Looking for DIF or other invariances that suggest fairness issues is not enough, so long as the structure of the domain model itself is barred from properly influencing test construction, as The Standards say it should.

To state this more plainly, this practice sets psychometric considerations as the main obstacle to developing valid tests—or tests that can be put to any valid use or purpose.

Share

Difficulty and Rigor

“Difficult” is a problematic word, in the world of teaching, learning and assessment. It refers to many related ideas, but because they all use the same word people often conflate them. (I just learned this year that that is called a jingle fallacy.)

As a former high school teacher, I can think of a whole bunch of different meanings that I might have intended from time to time.

  • Something is difficult if it takes a long time to do.

  • Something is difficult if it is unlikely to be successful.

  • Something is difficult if most people—or at least very many people—are likely to fail at it.

  • Something is difficult if I, as a teacher, have to spend more time teaching it in order for my to student develop proficiency.

  • Something is difficult if it marks such a change from earlier lessons that I need work hard teach my students to adopt a different mindset for it than they’ve had before.

  • Something is difficult if it is easily mistaken for something else, and therefore likely to attempted with the wrong tools.

  • Something is difficult if a person simply does not have any experience with it.

  • Something is difficult if someone has never had to do it themself before.

  • Something is difficult if few people will ever be able to do it.

  • Something is difficult if the path to gaining proficiency is quite long.

  • Something is difficult if precursors are not taken care of.

This incomplete list contains many different ideas, some of which overlap, some of which address radically different aspects of difficulty than others. Some of them might be viewed as contributors to difficulty and some as different conceptions of difficulty.

Large scale assessment (LSA) has a very particular idea of difficulty. In this context, difficulty is measured empirically. It has nothing to do with teaching effort or learning paths. Rather, it is simply the share of test takers who responded to an item correctly. Concepts and lessons do not have difficulty, just individual items. Because seemingly minor alterations to an item can radically alter how many test takers answer successfully, difficulty must be measured through field testing and monitored through operational use of an item.

In a teaching and learning context, however, this empirical difficulty is not actually a fixed attribute of an item. It is the result of the efforts of teachers and students. When teachers spend more time on some difficult content or lesson—or perhaps come up with a great new way to teach it—students can become more successful learners and then more successful test takers.

Ross Markle explains that educational (and other) interventions can undermine the psychometric properties of tests (and items), including when those interventions are prompted by those test. An item might be quite difficult (empirically) one year, but because teachers respond to that with new efforts the next year, test takers might be much more successful the next year. Other items like it might be empirically much more difficult in later years, perhaps to the surprise of test developers.

Dan Koretz has long pointed to how teachers respond to test items by altering their teaching to better prepare students for precisely those approaches to assessing content. Alternatively, one reasonable use of LSAs is to evaluate whether curricula support the development of targeted proficiencies. Thus, this kind of feedback loop can undermine teaching and learning, and it can support teaching and learning.

(Of course, all of this violates psychometric assumptions of unidimensionality, but that’s neither here nor there. )

From time to time, there is great talk about needing more rigor in school and especially in assessment. That’s the word that people use, “rigor.” But we think that people really mean difficulty. And we think they mean empirical difficulty. They want harder tests that produce more failing results. They appear to mean that a better test is one that produces lower scores.

We think that that is garbage. We think that a better test is one that better reflects the actual proficiencies of test takers with elements of the domain model—such as the state learning learning standards for that subject and grade.

And frankly, we do not think that rigor is about empirical difficulty or conceptual difficulty. We do not approve of using “rigor” as a euphemism for “hard.” Rather, we think that rigorousness is something like:

  • extremely thorough, exhaustive, or accurate;

  • strictly applied or adhered to;

  • severely exact or accurate; precise.

Perhaps large scale assessment should be rigorous. They certainly should be accurate. We would prefer that they be exhaustive, but do not see that they are given enough testing time and opportunities to do that. But those seem like reasonable goals.

Their empirical difficulty should be driven by the nature of the content they are supposed to reflect and the teaching and learning of that content. It should not be a goal of the vindictive or be targeted by psychometric considerations that are not deeply based in issues of content, of teaching and of learning.

Certainly, however, test development should be rigorous. Our test development processes should be demanding, and our own professional standards should be strictly adhered to. That is where we would—and do—apply the word “rigor”

Share

An Economist at Thanksgiving

I had an economist at my Thanksgiving table last week.

That’s not a joke or punchline. That’s just a fact. She doesn’t seem like an economist. She is remarkably charming, warm and personable. So, I usually don’t think of her as being an economist.

What do I mean by that, by “being an economist.” Well, in my experience I have found the field of economics to suffer from the greatest degree of disciplinary arrogance of any field. It appears that they train and acculturate this belief that their toolbox is so outstanding that there is no need to learn—or even respect—the substantive expertise of those in other disciplines. Disciplinary Arrogance.

Now, I am not what I would consider highly expert in school dress code policies. I have been following the issue for 30-40 years. I occasionally read professional research articles on the topic. I follow news links on new developments or examples. But I have not done my own original research or any sort of exhaustive review on the literature—though I sure have read more than one literature review from others over the decades. School dress codes and uniforms is one of those areas of policy and policy implementation that run right into my own beliefs on adolescent identity and social development. So, this topic brings a few professional interests together, even though I do not actually focus my own work on it.

But the economist has a different relationship to this topic. For some second-hand personal reason, she did (or trusted) a little bit of internet research on the topic and was highly confident that she understands the subtleties and complexities of how the law interacted with a particular charter school’s school uniform policy implementation. She thought she understood what the New York Board of Regents had required of public schools.

It was not a public school. It was charter schools, but she did not understand that charter schools are not public schools. (When I pointed out that there are simply privatized provision of traditionally publicly delivered services, she immediately got the point. She’s an economist, so she was already familiar with that dynamic, though she did not recognize it herself). It was not the Board of Regents, but was instead a new policy from the city’s Department of Education prompted by a New York City Council bill.

Most importantly, she did not understand the substance of the DOE guidelines. She did not know the history of the issues, the problems that the new guidelines were specifically designed to address or the significance of the particular language within the guidelines (e.g., “distracting). She thought that she understood it all, for having read the guidelines and applying her own knowledge and thinking to the document. She thought that her reading was accurate and her conclusions correct. She insisted on it. She refused to accept that she could have been wrong.

But expertise matters. Which means that intellectual and disciplinary humility matter. While playing in someone else’s backyard—and expression I think I learned from Andrew Gelman—can be fun, it is important to listen and to be mindful of the limits of one’s own expertise. It is important to learn from those with greater expertise, usually by asking questions. This requires acknowledging the expertise of others.

Which, unfortunately, is something that economists generally are not very good at.

Share

The Overarching Goal of Schooling

For my first masters degree, I had to do a master’s thesis.

I wrote about the importance of a school having a purpose, or what I called “a school-wide working philosophy.” Because I graduated from college certified to teach (i.e,. back in the 20th century), this thesis came ten years after I began my teaching career. I had worked in an enough schools to have seen some stuff.

I was not particular about the purpose. Rather, I was concerned that when different teachers and different programs in a school are aimed at different purposes, they undermine each other. For example, athletics can undermine academic classes when coaches pressure teachers to raise a student’s grade so that they meet grade requirements for eligibility to play. Traveling debate teams can undermine academic classes, too; when any extra-curricular activity pulls students out of class for a road trip of any sort, students miss lessons for their academic classes.

Test scores and deep content lessons are not generally aligned. Core academic lessons and developing an outstanding college applicable are not necessarily aligned. Social development is something else, entirely.

Today, decades later, I still believe in program alignment. I still believe that schools should be clear about their priorities and what they are trying to do for students. I still believe that the hodgepodge of different programs with different allegiances among students and among educators is dangerous.

But I think that today I am more accepting of the existence of multiple goals. This makes the alignment problem even more challenging. This makes leaders’ roles even more important, as they try to build and manage a school or system that supports multiple goals without them undermining each other. The decisions are tougher, not easier, this way. And leaders still need to say “No” sometimes.

Nonetheless, I still believe in an ultimate and overarching goal for our schools. I do not believe that school exist primarily to deliver some benefit(s) to individual students. That is not a good enough reason to require all children to attend or to tax everyone to pay for this enormously expensive endeavor. Rather, schools exist to serve communities.

Therefore, whatever programs or purposes that schools might have, they should be to serve the community. Is students’ social development important? Surely. Preparation for citizenship? What could be more important than that? Preparation for economic contributions? Yes, that is important, too.

This is my lens. I think this is always my lens, the ultimate goal of schools and schooling. The overarching goal. Not merely a justification other try to hang things from, but rather a value with which to judge efforts in and around schooling. How can this school and its program support the communities it serves?

Therefore, schools must be under the control of communities. Strong influence from the local community. Influence from the regional (e.g., state) community. And so long as we are such an interconnected country, influence from the national level, as well. They must be communal affairs, and not perverted into serving private individual interests.

We already have private schools that are designed to serve individual interests. No, they are not good for communities. They are not good for democracy. They are not good for a pluralistic nation, region or community. Therefore, it is even more important that our public schools be preserved and supported. Therefore, it is vital that we protect them from privatizing interests.

Share

The Road from Validity (Part II): Cementing the Replacement of Validity

Last month, I explained how large scale human scoring departs from carefully constructed rubrics and procedures to be replaced with something different. The desire for reliability and procedures to favor reliable scorers (i.e., those most likely to agree with other raters) serves to push out those who use the valid procedures and favor those who use the faster procedures.

Of course, we do not live in an age of large scale human scoring of test taker work product. We have replaced most human scoring with automated scoring. There are algorithms and artificial intelligence engines of various sorts that to the scoring, instead. They are be trained by examples of human scoring, but the bulk of the work is done by automated scoring.

How do we know whether the automated scoring engines are any good? Well, their creators, backers and proponents proclaim again and again that they do what human scorers do, only better. Not just faster and cheaper, but better.

What does better mean? Well, they use the same idea of reliability to judge the automated engines as they do to judge human scorers. The make sure that the automated scoring engines have high agreement rates with the pool of human scorers on some common pool of essays. That sounds fair, right? They use the same procedures used to judge human scorers and decide who is invited back to judge the machines.

This means that the same dynamic that replaced validity with reliability—that replaced the valid scoring procedure with the faster scoring procedure—is used to tune the algorithm.

No, automated scoring is not better at using the valid procedure—not faster, cheaper or more diligent. No, it means that automated scoring is better at using the faster procedure.

Of course, once the faster procedure is coded into the algorithm, it is harder to question. This is yet another form of algorithmic bias. Algorithms trained on biased data will perpetuate those biases in perpetuity, rather than highlight them. In this case, it just repeats the biases that make up the deviations from the valid procedure. Whatever was harder or slower or more nuanced in the valid procedure that made it slower is left out.

No, machines do not make mistakes. They do what they are told to do. The human scorers who depart from the valid (and slower) scoring procedures are making mistakes. But the machines are simply doing what they are told to do—trying o match the results of the human scorers.

And their developers and proponents brag on that, as though doing what they are told to do is necessarily doing the right thing. They fail to audit their training data (i.e., the results of human scoring) for validity, trusting in their beloved reliability. So, their algorithms cement the biases and departures from validity and hide them behind claims of reliability—as though reliability is the same thing as validity.

Not intentionally. Perhaps not quite inevitably. But quite consistently. Quite reliably.

Share

The Road from Validity (Part I): How Reliability Replaces Validity


Imagine that you have some complex multidimensional trait that you want to test for. 

A. You get a bunch of experts and ask them to rate the recorded performances of test takers…only they don’t agree on scores. They rate the performances inconsistently because they do not agree on how to weight the different contributing factors and dimension. They are all experts, so they are not wrong, but you judge the inconsistency a problem. 

B. So, you train them. You provide them rubrics and exemplars. You tell them how they should be scoring. You try to standardize their judgments. 

C. Then, you want to remove the outlier scorers and keep the ones who are consistent with each other. You want comparable scores, so you can explain what the scores actually mean, what they refer to. You don’t want that mess of inconsistency, which you believe does not actually help anyone. 

To ensure that test takers are rated fairly and outlier scorers are caught and removed, you have every performance rated by two different scorers, selected at random. If they agree, that’s good. If they disagree, that’s bad. Over the course of a day or a week, each scorer rates a whole bunch of performances, so you can check their overall rate of agreeing with their randomly selected scoring peers. The ones who tend to agree more are the ones you keep, and ones to seem to march to their own beat more often are not invited back. 


D., You are actually running a business or have some sort of budget. You want the scoring done as quickly as practicable. And the scorers? This job is rather repetitive and boring, so they don’t focus as well on each individual performance when they are rating so many of them. These are two different pressures for speed, both top-down and bottom-up. Some scorers go faster, taking little shortcuts and using their own heuristics in place of the full formal procedures laid out in step B. 

Is that real scoring expertise at work? Is that developing skill? Or is it replacement of the official procedures with something else a bit faster?

E. Let’s look more closely at what happens to the agreement rates. We’ll use easy round numbers to make the analysis simpler. Imagine that half of the scorers take that faster route, and half go by the officially sanctioned formal route. And imagine that the faster route is twice as fast. That means that the faster route is going to score twice as many performances and the slower group. That means that everyone’s randomly selected scoring partner is twice as likely to be a fast scorer than a formal scorer. 

In our little simplified thought experiment, these two groups are using two different methods of scoring, and they sometimes will differ in the scores they report. Whether I am a fast scorer or a formal scorer, 2/3 of time, I will be randomly paired with a fast scorer. I will have better agreement rates if I use the fast scoring method than the formal scoring method. The fast scorers will have higher agreement rates than the slow scorers. 

The fast scorers will be retained and the formal scorers will be replaced, and some the replacements will opt for the faster method. This will further lower the formal scorers’ agreement rates and further increase the faster scorers agreement rates. 

*********************

What has happened to your construct? What are the test takers performances being judged against? Would you even notice the shift?

Would you notice use of reliability to evaluate scores drove a shift from the formal and documented scoring procedures that were designed to best evaluate the construct to some—perhaps obvious—shortcuts that do not consider all of the dimensions and subtly of the construct? 

Share

The Unacknowledged Tyranny of the Platinum Standard

I just got back from an educational research conference, and as is my wont, I had a lot of conversations about assessment and educational measurement.

On the morning of the last day of the conference, as people were saying their goodbyes, I found myself in conversation with a brilliant young psychometrician still on her first job in industry. I was pushing her to consider examining the application of some sort of multi-dimensional psychometric model when she got a chance to do her next little research project. She was concerned that that might mark her as being a little weird, as the industry is so heavily invested in unidimensional psychometric models. She pulled in Yu Bao, a professor in James Madison University’s Assessment & Measurement program, who was walking by. Yu agreed with me that there’s a lot of room there for a psychometrician to make their name there with multi-dimensional models.

I went on one of my typical rants about the mismatch between unidimensional psychometric models and multi-dimensional domain models and the platinum standard. That is, the way that psychometricians bring model fit statistics to data review meetings and strongly suggest that items with poor model fit—poorly fit to the misplaced unidimensional model—be removed. (They do this with item difficulty statistics, too, but that is not as bad for validity claims as this use of model fit statistics from inappropriate models.)

This young psychometrician pushed back, however. She said that psychometrics uses unidimensional models because they fit the data better.

But that’s not true. That’s not true in practice and that is not true at research conferences. Just the previous day, a colleague of mine told me about a session that he attended—and walked out of. There, a young psychometrician was explaining the use of factor modeling techniques to something something something—I didn’t attend that session, so I do not know what he was trying to do. He showed that item 31 did not fit his model, so they removed it. They did not remove it because it was not well aligned to the assessment target or larger macro-construct. He never even looked at the actual item itself. Rather, they just removed the item because it did not fit the psychometric model they were using.

No consideration for the construct’s theorized model. No consideration of the formal domain model. Only the psychometric model.

My colleague was walking by, so I pulled him into the conversation. Yu agreed that this happens sometimes, despite what the brilliant young psychometrician had been taught and expected to see.

My colleague and I know that this happens quite a bit. Psychometricians come with their opaque techniques and intimidatingly precise numbers. Few people outside of psychometrics have the confidence to push back against people armed with something they do not understand, and that precision—and all of its decimal places—is so easily mistaken for accuracy.

The platinum standard is powerful. It shapes our assessments, and not for the better. It leads to the removal of possibly the best aligned items, simpy because they do not accord to the demands of psychometrically more simple tools and their rather unrealistic assumptions. The platinum standard forces those inappropriate assumptions on the entire field, requiring those who actually focus on the domain model and content alignment to simply accept the demands of psychometrics.

Share

Content Development Professionals Require Different Expertise than Teachers

Because assessment is an important part of teaching, it is not surprising that content development professionals (CDPs) for large scale assessment (e.g., standardized testing)—the professionals who develop and refine the contents of tests—require many of the same skills, knowledge and expertise as teachers. However, CDPs also need other skills and knowledge to work at a high level.

Unlike teachers, CDPs do not have to worry about classroom management or lesson plan construction. But they do need a teachers’ understanding of the content area and how to think about content and learning paths. Of course, both teachers and CDPs need to understand the cognition of others without unwittingly projecting their thinking on them. 

However, CDPs need to think about these things a bit differently than teachers. Large scale assessment does not have as many opportunities to assess students as classroom practice, so it must do so much more efficiently. Teachers can triangulate lots of different information from and about students to figure out whether they understand something, but large scale assessment usually depends on a single assessment to make that inference. Therefore, CDPs need a much more precise view of evidence than teachers’. They need to be able to recognize the ambiguity of evidence so that they can create test items that can elicit more definitive evidence of a test takers’ level of proficiency.  

While teachers often focus on how to integrate the knowledge, skills and abilities found in various learning standards into larger lessons and activities, CDPs need to understand how to isolate them while still preserving some amount of authenticity. They need to be particularly mindful of the kinds of mistakes that learners make and how they relate to particular learning goals—recognizing their connection to the targeted cognition of an item.  

Like teachers, CDPs need to understand how others think—often others very different than themselves. Teachers have their students in front of them, and can learn more about them over time. CDPs have to imagine test takers, rather than being reminded of them every day by their presence. Moreover, the range of diversity and dimensions of diversity are vastly larger with large scale assessment than a single teacher in a single school must account for. We call this radical empathy because of the amount of variation in background, experience and perspective that CDPs must be able to anticipate. 

CDPs also require technical knowledge and skills that teachers do not. CDPs need to know the differences between different item types, how they work, and which are most appropriate to elicit evidence of different sorts of cognition. They need to be able to recognize problems in a multiple choice item and how to make it better elicit evidence of the targeted cognition for the range of typical test takers. The fact is that it is incredibly difficult to create high quality multiple choice items that produce high quality evidence, a fact that makes it more important to take those challenges seriously. 

They need to understand the workflows, contributors and collaborations that comprise the test development process. Moreover, they need to have the ability to push back against the various pressures to alter items in ways that compromise they ability to elicit evidence of that targeted cognition for the range of typical test takers—or even omit them entirely from an assessment. Of course, all of this requires understanding the values and thinking of the the many different disciplines that contribute to these collaborations.

I would never suggest that CDP work is more difficult or complex than teacher work; clearly it is not. Working with children—of any age—and being sensitive to their needs is enormously challenging and complex work, made more so the official and unofficial learning goals. All of those challenges are magnified exponentially by the reality of how many children are there at the same time. 

However, the work of developing the contents of standardized tests is itself complex and difficult, mostly in ways unappreciated by the public—and even by others involved in large scale assessment. It leans on many areas of skill and knowledge that overlap with teachers’ expertise, but it has different goals and constraints. Therefore, it also requires different expertise—including, but not limited to—expertise in the content area.

Share

Content Development Work Requires Far More than Just Content Expertise

There is a cynically and incredibly foolish expression, “Those who can, do; those who can’t, teach.” Yes, it is grounded in the idea that many teachers are not the deepest content experts. However, it is foolish for two reasons. First, it entirely misses the fact that good teaching requires its own set of skills—skills that mere content experts usually lack. Second, and perhaps less obviously, it misses the fact that teachers must have a different relationship to the content than mere practitioners—even those at the highest level of experts who practice with the most nuance, skill and wisdom.

Teaching requires thinking about the content, holding it at arms length, rather than just using it. Some (e.g., teachers) call this meta-cognition, thinking about thinking. Being able to do does not require consciously understanding what it is you are doing or being able to communicate to others. In fact, that kind of thinking can get in the way of fluid skillful practice. It does not require understanding how others might do the skill. Teachers have to understand the kinds of mistakes that learners make, and the different learning paths towards proficiency. 

Content development work for large scale assessment (e.g., standardized testing)—the work of crafting and refining the contents of tests—requires many of the same skills as teaching. It requires thinking about the content. Like teachers, content development professionals (CDPs) need to understand how others understand the content, and the ways in which they might misunderstand or misapply it. They need to be able to recognize their own thinking and learning path, but not be so self-centered as to assume that it is the only learning path. Like good teachers, they need the empathy to imagine the cognitive paths of others—including those with vastly different backgrounds and experiences. 

Yes, and like teachers, CDPs need content expertise. And like teachers, they need far deeper content expertise than most people realize. They need to understand how the content fits together and how it is applied in practice. They rarely have the fluidity of a practitioner’s mastery at the highest level of professional practice, but they have deep understanding of content, nonetheless. 

And they also need many of the expertises of teachers, in addition to their own particular additional skills, knowledge and abilities

Share

Has IRT Destroyed Old School Validity?

When I first learned about the measurement idea of validity, I was taught that it is about whether the measure actually measures the construct. I was taught that validity pairs with reliability, which is about how consistent the measure is. Reliability is like the margin of error of from one measurement to the next, but validity is whether you’re aiming of the right target. I have had this definition in my head for…decades? I think I first learned about this in a psychology class in the 1980’s. 

When I came to the field of educational measurement this century, I found a somewhat different definition. Since the 1999 Standards for educational and psychological assessment (AERA, APA, NCME), in educational measurement validity is about whether there is evidence to support the use of a test or measurement for a particular use. We all stress that validity is no longer a property of the test itself, but rather is a property of the particular test use. And there are different kinds of evidence that can contribute to this idea of test validity.

That shift to attention on particular test uses is really important. When tests are repurposed, they might no longer be appropriate. For example, a very good test of 4th grade mathematics is simply not a good test of 6th grade mathematics. It is not that the test has changed, but rather that its use has changed. So, the old validity evidence for the old use is no longer germane. 

I buy that. I really do. But I still have in my head the issue of the basic inference. That is, totally apart from test use, does the test actually measure what it claims to measure? Are the inferences we make about test takers…valid? I think that that still matters.

In fact, I think that whether the tests measure what they are supposed to measure is the real point. I think that that old school idea of validity as simply the question about whether the test measures what it is supposed to measure is critically important. If it does then appropriate uses are kinda of obvious. And inappropriate uses are also kinda obvious.

So why the shift from the 1985 Standards to the 1999 Standards?

I have a theory that is probably incorrect. But it’s in my head.

For decades, the statistics behind big standardized tests has been based on something called IRT (item response theory) and before that it was based on CTT (classical test theory). Each of these generally reports a single score that is useful for sorting and ranking test takers. No matter how many different elements the test is supposed to measure—like the different standards in a domain model—they each report a single unified score. However, for them to work reliably, test developers remove potential test items that seem like they might be measuring something a little different than the other items. So, the better that each item does at measuring its targeted standard, the less likely that item is to be included. The more an item instead kinda measures some muddled middle idea of the construct, the more likely it is to be selected. Psychometricians call that “model fit,” and the model is usually unidimensional IRT or CTT. 

When there is a conflict between a multi-dimensional domain model (e.g., the different knowledge, skills and abilities that go into a set of standards) and a unidimensional psychometric model, modern educational measurement favors the unidimensional model—throwing aside items that better fit the domain model than the psychometric model.

As a content person, I have never been able to figure out what that mushy middle means. On a 4th grade math test, it’s some vague idea of math ability…but it’s not clear which aspect of math ability factor in and which do not. It might include ability to learn math. But how much? It might include computational accuracy. But how much? It might include problem solving ability, but how much? Or even reading ability! Because model fit statistics lead to the removal of really strongly targeted items (i.e., as opposed to items that lean towards the muddled middle), I don’t think we could ever know.

These technique produce a seemingly definitive ranking of test takers with seemingly definitive quantitative scores—often to many decimal places. But it is never clear what they are ranked on. Something about math…but what? They most definitely are not a thoughtfully weighed composite score when IRT is combined with item selection and model fit statistics. 

Which takes me back to the question of old school validity vs. news school educational measurement test validity. Was the change necessary simply because we never know what IRT is scoring students on, from a content perspective? That is, IRT results are not interpretable through the lens of the construct, so we not longer focus on the inference?

That’s what I am thinking about, these days.

Share

Are we measuring the right construct?


Imagine that you are in a kitchen and need to measure the volume of some odd solid object, or the difference in volumes between two odd solid objects. But the only real measuring tools are scales (i.e, a kitchen scale and a bathroom scale) and any number of household tape measures, rulers and yard/meter sticks. And the internet is down.

* One approach might be to simply take the mass of the object(s) and figure that most things have a mass of around 1 g/cm3, and go with that. If you need the difference, take the difference. 

* Another approach might be to do that Archimedes thing and try displacement. Fill up a cup or larger container to the bring with water, drop the object in the cup and catch all the water that the new object forces out of the cup. That would take a saucer (or serving platter) under the vessel to catch the water. Measure the mass of that saucer (or serving platter) empty and with the water. Eureka! The difference is the volume, so long as you convert the units, right? So clever, that Archimedes. 

* The third, and hardest approach would be very much like the second approach, but it departs from the Archimedes version, because these objects are not gold crowns. You’d need to push the object down into the water, making sure that it is entirely submerged—but without putting anything else in the water. Either push it JUST under the water, or use some very very fine tools to hold it down further. Again, calculate the mass of the displaced water and convert the units. That’s the mass of the object, and just subtract the lower mass from the greater if comparing two objects.

The third approach is way more clever than the first two, and is the only one that will actually give you volume. The first approach approximates volumes, but will not work for objects that easily float or sink—signaling a density significantly different than water’s. The first approach just gives you mass. The second approach will work for denser objects, which do entirely submerge in the water, but not for objects that float (i.e., are not entirely submerged). For the former, yes volume. But for the latter it just gives mass again. Not actually as clever as we thought. 

(Archimedes’s experiment as a bit different, and he had a whole bunch of spare gold lying around. Neither you nor I have that available for our work.)

I have no doubt that there are many people who think that psychometrics is analogous to the third approach. That it really is clever enough to take the products of limited tools to measure difficult constructs. But what I see is  a dependence of limited tools that simply measure something different than the intended construct. And, no, the analysis is not so clever as successfully to convert the results to the intended construct. Disturbingly, it is not that adequate tools are not available, rather it’s the insistence on using unidimensional psychometric models and filters to measure multi-dimensional constructs. There are other models, they just are not favored. Perhaps they are not as easy to use. Perhaps they don’t have the established place in curricula and/or practice. Perhaps it is simply if we've always tended to use a hammer, we tend to redefine problem into problems that can be solved with a hammer. 

But the charge of testing is to measure the intended construct, not some other construct that our favored tools are better at measuring. 


Share

The Worst Reasons to Reject Change?

Ages ago, when I was in grad school, I learned from Susan Moore Johnson that many people incorrectly cite union contracts as a reason why something must be done, why a practice cannot be altered or a a innovation cannot be picked up. As she explained it, people do not actually read the contract, so rumors about what is in it are often even more powerful than the contract itself. This was eye-opening to me at the time, and it has stuck with me. But I think that it was simply too polite an interpretation. Yes, that is often the case. But I think that sometimes—perhaps even more often—it is a willful ignorance. Some people do not care what is in the contract, and are simply invested in arguing against change—regardless of whether their arguments are made in good faith.

Whether the argument that something is in the contract is made in good faith is itself a contentious question. So, we can put that aside. Regardless of whether it is made in good faith, it is often an erroneous argument used to push back against those calling innovation or change. 

I have come across this exact same tactic in other contexts as well. 

* About 10 or 15 years ago I was trying to report a bug to Apple in some piece of their software. The level II support specialists I was speaking with went up to a level III support specialists and came back to me with an excuse. The way I was using the software, he explained, violated the end user license agreement. It wasn’t a bug, you see, it was a misuse. But I knew that couldn’t be the case, so I opened up the very long end user license agreement while on the phone with him and went through it, looking for anything relevant to his point. Of course, it wasn’t there. This was a moral victory, as he had to admit that his superior’s excuse was untrue. (I do not know that it got the bug fixed any faster. I switched to a third party application and have not felt a need to go back to Apple’s app for that use.)

* Just last month, a regional chain with ~75 stores opened a brand new supermarket near me—now my nearest supermarket, just 14 minutes away. Unfortunately, there are handful of operations mistakes that make shopping there just a little more annoying than it needs to be, but they hit me every time I go there. I have mentioned them to the "customer experience manager" and last week I saw a team of muckety mucks from headquarters going through the store to make a list of lessons and things that might be fixed. I took the opportunity to mention a couple of my concerns. As I was checking out, one of them came to me to thank me for my feedback. He said he was the head of store design for the chain. I took the opportunity to share another concern, one that would actually take a little—just a little—money to fix. While I was talking with him some assistant manager (from another store) came up to defend the chain’s honor. He started making excuses that I knew weren’t true. Eventually, he said that they couldn’t fix the problem because of the ADA. For me, that was too much. 

I replied to him, “You mean that the Americans with Disabilities Act, signed into law in 1990 by President George Herbert Walker Bush (and perhaps amended since then) has a provision in it that says what side of the self-checkout machine the groceries go on? I bet you $100 right now that that it not true.” It actually wasn’t the first time I had used that line about making such a $100 bet that day. Just earlier, when I was talking to an another assistant manager in the store, he said that 80% of people who go to the store end up in the refrigerated section, and I knew that could not be true. (The head of store design confirmed that it wasn’t at all close to 80%, and they didn’t want it to be.) This wasn’t even the first time that they mentioned the ADA, as that assistant manager also tried to invoke it to push me off another point I had tried to share. 

I have observed that the federally required peer review process for many large scale assessments is also an intimidating citation used to push back against innovation or improvement efforts. People claims that because a test has already gone through peer review, no processes or documentation can be improved. People claim that some innovations cannot be used or applied because it will never get through the peer review process. It’s just the same damn dynamic.

What I see over and over again is people who simply are against change, do not want to alter what they already know and are comfortable with and are not invested in improving the product or process. But instead of going through an accurate analysis and/or give real reasons to oppose change, they simply grasp for some powerful authority that they can claim is the unmovable obstacle. They do not have to own their opposition, and they certainly do not have to think deeply about evaluating the proposal. They do not even have to have power to reject the change. Instead, they claim some expertise about that other thing that is the real obstacle.

But I know what the ADA is. I know what goes in EULAs. SMJ taught me to actually read union contracts. And I even know enough about peer review to know that it is not the obstacle that it is made out to be. It was not meant to be an obstacle to improvement, and really doesn’t have to be. 

More generally, I do not think I am ever going to get over my fury when people try to prevent change by hand waving at intimidating-seeming authorities that they do not even understand. Again, it hardly matters whether they know better, because their ignorance is willful and the citation of authorities they know little about is intentional. It is just fear of change, fear of thoughtful deliberation and an unwillingness to take responsibility for maintaining a their preferred (and problematic) status quo.

Share

What We Mean When Talk About 'Reliablity'


One of my little pet peeves is when athletes say that they to be more consistent at something that that have long been consistently mediocre or bad at. I agree that they need to be better, but I don’t think that the problem has been consistency. Heck, a basketball player going from a 23% 3-points shooter to a 35% 3-point shooter has not even become more consistent, even though that would constitute a rather large improvement. 

Words have meanings, and while I love metaphorical language, when words with rather precise meanings are expanded, our ability to express precise things is diminished. I find that frustrating—perhaps because I was raised by a lawyer and perhaps because I was such a math and science kid growing up.

But the fact is, that words can have different meanings in different contexts. This is certainly true when words have technical meanings in expert fields and also have lay meaning for the general public. 

Reliability is one of those words, and it is a very important technical word in the field of educational and psychological testing. And yet, it is also a middle school level word that refers to trustworthiness. 

In everyday use, a reliable person is someone you can trust to be there and to do the right thing. It is not just consistency, but also usefulness and worthiness. 

But the statistical term, as used in many technical fields, merely means consistency. Something can be consistently off by the same amount, and that would be reliable. Statistical reliability is only about consistency, regardless of appropriateness, precision, or actual accuracy. Under this definition, a car that only—but always—starts up when it is over 90 degrees outside is a reliable car. Moreover, it would be more reliable if it only started up when the temperature was over 100 degrees, and most reliable if it never started up at all. After all, that would be perfectly consistent—consistency useless.

This is a particularly important gap in meaning in my field because when psychometricians insist on maximizing reliability, that sounds really good to a lay audience who does not appreciate the difference between the everyday term and the statistical term. Psychometricians want consistency, even if it means consistently the wrong thing or leaving out the most important stuff because it hurts the consistency of the test. They say they are increasing reliability, and they are not lying. Heck, I am sympathetic to their use of the term as it is also what I tend to think the term means, too. That math kid started taking advanced Probability and Statistics courses nearly 40 years ago. That meaning is what I have in my head when basketball players talk about being a more consistent shooter.

I think a lot about important technical problems with how psychometricians’ focus on that statistical idea of reliability leads to worse tests, but perhaps the bigger problem is how their different meaning of reliability misinforms policy-makers and the broader public about what they are even trying to do. The broader public and policy makers are the real audience for our tests, and we should be mindful of how they hear what we are saying. 

Share

Not Being Mind Readers, There Are Things We Cannot Know

We need cognitive tests because we cannot read people’s minds. Instead, we have to find evidence that supports various inferences we might want to make—or that refutes them. This dilemma of not being able to read minds is not limited to testing.

Hanlon’s Razor

Hanlon’s Razor advises, “Never attribute to malice that which is adequately explained by stupidity.” But the idea goes back much further. In the 19th century, H. G. Wells wrote, “There is very little deliberate wickedness in the world. The stupidity of our selfishness gives much the same results indeed, but in the ethical laboratory it shows a different nature.”

In my teen years, I thought that the worst trait a person could have was incompetence. Nope, I’ve never been fun at parties. I certainly have that history of seeing incompetence around me. But for the last few years, I have been faced with a situation outside of my professional life that I attribute to malice. Others whom I respect agree, but temper it with judgment of some amount of incompetence. Certainly, many people seem quite unwilling to see this particular form of malice.

How can I know? How can any of us know? We cannot see inside the hearts and minds of those around us, not even that one woman.

Optimizing political Strategy

In the days immediately after Joe Biden’s disastrous debate performance in his 2024 debate with Donald Trump, I cautioned those around me that it would be foolish for Biden to drop out of the race before the RNC nominating convention. There would be no way to take away coverage from the RNC, and it would be wise to let Trump’s Republican Party waste its powder on attacks on Joe Biden and his age when they had maximum free coverage by the news media.

I cautioned that no one is going to remember a few weeks in July when we actually get to November. The DNC nominating convention was still over a month away, and there was not much in July or August that would matter by Election Day. Electoral campaign memories are short, perhaps unfortunately short.

I said that the optimal strategy would be to wait until…July 19 or 20 for Biden to drop out of the race. That would be weeks before the the Democrats officially nominated their candidate. I didn’t want a traditional circular firing squad, and thought the best strategy would be to go with Kamala Harris—though she was not my preferred candidate in 2020.

Obviously, for this bait and switch strategy to work, there could not be leaks. Anyone who might leak anything to the media had to be ignorant of the plan. Thus, people in the know could not tell their aides—or perhaps even their spouses. For this to work, Joe Biden would have to look stubborn—even as the pressure mounted from people who did not know the plan

I did not anticipate that Biden’s delay would build up such energy for replacement that a politician who produced so little excitement in 2019 and 2020 would be as well received as Vice President Harris’s candidacy has been. And I thought that Navel veteran and Rhodes Scholar, obviously conservative family man who is comfortable debating Republicans on FOX news Pete Buttigieg would be a great running mate for her.

In fact, I was off by a day. President Biden dropped out of the 2024 presidential race on July 21. But otherwise…was my prediction wrong?

How can I know? How can anyone know? This had to be a no leaks plan. It would require Biden to look like an old grandfather who absolutely refuses to give up his car keys. He would have to take the further reputational hit in order to help his party to retain the White House.

What were Joe Biden and his closest most trusted advisors thinking? Could the greatest political strategist of this century, Nancy Pelosi, have come up with this weeks ago? Could President Biden have gathered the Clintons, Obamas and her for a serious strategy session? (Not Chuck Schumer. I do not trust that he would not leak.)

How could we possible know the truth? Even if word leaked in the months or years ahead that this was planned, why should we trust that? It would make Democratic leadership look brilliant, so there is real motivation to leak such claims after Republicans cannot do anything about it. I do not know what to conclude, and I do not know that I ever will. (Even I do not think that the plan—the conspiracy—could go further back than the debate, but how can I be sure…?)

I cannot read anyone’s mind.

The Challenge of Intellectual Humilty

I really try to be intellectual humble. I try to be aware of what I think and why. I try to be conscious of what I really know and the absolute facts available to me. I try to be mindful of when I run up the ladder of inference, even if it is just a single rung.

Yes, it would be validating for me to conclude that that woman is motivated by malice, rather than just stupid. Yes, it would be satisfying to me to think that our political leaders are brilliant, rather than just bumbling.

But I cannot read minds. I need to live with that uncertainty. At the same time, I need to look for whatever confirming and disconfirming evidence might be available—both professionally and in the other spheres of my life.

Share

They Are All Norm-Base Tests, Brent

Track & field’s 100m sprint is a norm based test. Though it is not a cognitive test, it exhibits so many of the causes and symptoms that make norm-based tests so problematic. It is a test designed to rank participants by giving arbitrary weights to a collection of related skills and then claim a definitive result, in large part through the use of numbers. From the 100m sprint, we get a declaration at the Olympics of who is the World’s Fastest Man and Woman. But I just don’t buy it.

I have learned through the years that there are three main phases to the 100m spring. First is the start, then the acceleration phase and finally..well, I see and hear it called different things. The constant speed phase. The maintain phase. Whatever. The name is not important. What is important is that different sprinters have different strengths. Sure, if you are the best in the world at all three of them you are going to win, but that is quite rarely the case. Sha'Carri Richardson is stronger at the third phase than then first phase, as was Carl Lewis. 

When I was growing up, we did the 50 yard dash. The National Football League judges speed with a 40 yard dash. Indoor track has a 60m event.  International Track & Field does not use a 100 yard race, but rather a 109.36 yard race (i.e., 100m). Why these differences? Ummmm….well, one could offer different reasons to support one distance or another, but there’s no definite best answer. It is arbitrary which one we should use or respect most. However, the longer the distance, the more important that third phase is, and the shorter the distance the more important the other two phases are. 

When I was growing up, we did not get to use starting blocks. In fact, we had to begin from a standing start. Why prefer starting block or a standing start? There are reasons for each, even good reasons for each. One could go either way, so the decision is arbitrary. 

This is no different than big math or reading tests. Math and reading are each made up of a large variety of skills. How much should the SATs or the ACTs depend on calculation skills? How much on solving algebraic equations? How much on making sense of word problems? How much on drawing graphs and how much on reading graphs? Obviously, there are more skills than that, and there’s no definitive reasoning for how we should weigh them in order to come up with a final singular score. 

Any test that offers a final singular score is intended to sort and rank test takers. This totally makes sense at the Olympics and other sports competitions. But it is just about useless when it comes to teaching and learning. A track coach is not going to learn anything about what to tell an athlete by looking at 100m times. It says nothing about what they are good at, what they are bad at, what mistakes they are making, or where they might most benefit from further instruction or practice.

Normative tests are good for the final competition and useless for teaching and learning. 

Break down the three phases of the race into separate times and the coach can use their expert knowledge and experience to zero in on what phase is the weakest. Allow the coach to actually see their work (i.e., watch the race) and they can break it down further and zero in on useful coaching. 

But you can’t crown the World’s Fastest Man (or Woman) if you break it down like that. Which phase counts most? Should we focus on top speed? Best time over their fastest 20m? Fastest acceleration? Should the maintain phase be like 20m, or like 70m? 

Any test that reports a singular score is meant to sort and rank students. We cannot do that with profiles of proficiency with an array of different skills, but that is a whole different kind of test. And yet, those profiles are what are useful for teachers and students, what is useful for teaching and learning. 

Unfortunately, while not all standardized test are normative tests reporting a score of some arbitrary composite of different skills, almost all of them are. And that will always be a problem, perhaps even an obstacle to the process of education. 

Share