Inclusion in Assessment Development: Making Use of Diversity

It should not be hard to understand the meaning of inclusion in assessment development, as so many of us have been classroom educators. 

For classroom educators, inclusion means including special education students in regular classrooms, lessons and activities—rather than in the building, but it in self-contained classrooms. It is about including those students where the main action is, rather than marginalizing them over there in some other part of the building.

This same logic applies in the workplace. It is not enough to merely include diversity in the organization if it is marginalized over there. It’s not enough that it is listed on paper as being part of the team, but not in the room where issues are discussed. It is not enough if it is not at the table where decisions are made. 

Inclusion is about actually taking advantage of the potential of diversity on our teams to help our projects and our products. 

Obviously, this matters quite a bit when to comes to writing assessments for the diverse range of test takers who take our products. If our diverse voices cannot be heard appropriately, then the promise of enlisting them in the first place was met. I would suggest that disciplinary lens is another dimension of diversity that should be acknowledge and considered in the context of inclusion. Discussions of issues and decisions need to be open to those diverse voices, or else their knowledge and potential contributions will be wasted, and our products will suffer. 

I suppose that this is an aspect of balancing confidence and humility, of knowing when to listen—which requires ensuring that those other voices are present for discussion and decisions. If we did not have a history of marginalizing some voices and excluding some perspectives from the room and table, this would not be notable. But we do have those histories, so we need to be careful to break those old patterns and establish new norms for how we ensure that our product (and decisions) are able to leverage the potential of the diversity within our teams. 

Because of longstanding norms of power and who is centered, this requires intentional effort and attention to ensure that those voices are truly included. Because this is about cultural norms and power—yes, it really is about power—efforts to truly include those voices and perspectives take more work and more difficult work than those who have always been included realize. It takes more than merely literally including people in the room and having them at the table. It takes the work of giving them the confidence to speak up and the work of giving the others the humility to listen. 

But it is all worth doing because it products better products that have a better chance of being put to some valid use and/or purpose.

Is It Time to Just Ignore NAEP?

Should we be paying this much attention to NAEP? I don’t think so.

Differing Standards

Are you an expert in anything? What do you think the important knowledge and skills in that topic are? Could you make a list of them—an organized and detailed enough list to guide years of instruction on that topic?

Let’s imagine cooking. Here are some questions you’d need to figure out

How important is baking? How much might you want to focus the skills and knowledge of baking breads? Cookies? 

Roasting? (What is the difference between roasting and baking, anyway?) What are all the important skills of roasting meat? Roasting vegetables? Roasting pastas—or is that baking?

Grilling? Is that the same as barbecuing? What are the important skills and knowledge there? Still gotta cover gas, charcoal and wood? 

What are the important skills and knowledge around salads? What is a salad, anyway? 

Old school skills: aspic? Jello mold? What about them?

What about principles of healthy cooking? What are those? Are they worth including? In what year? What are the skills and knowledge?

Blooming spices? Is that on your list? Should it be? 

Reusability of parchment paper? How to clean cast iron pans? How to season them? How to clean a blender properly? How to make clear ice cubes?

The thing is, my year by year list of critical knowledge, skills and abilities would be different than my co-author’s, and different from my wife’s. And different than yours. Two really good lists could still differ significantly—even radically.

We used to have more variation across the states when it comes to state learning standards for math and reading. We have less variation today, in large part because of the widespread adoption and adoption of the Common Core State Standards (CCSS). One would think that that eases the problem.

NAEP is Not Aligned to Common Core

The problem is that the widespread influence of CCSS has not really gotten to NAEP. For NAEP to align to Common Core would break comparability over time. Measure something else, even something only moderately different, and you should not compare the old scores to the new scores. All those longitudinal sequences would break—and longitudinal sequences is a big raison d'être of NAEP.

There is lot that I really like about NAEP. It is very well designed, produced and implemented assessment. It’s just testing the wrong thing.

But it does not measure what teachers are told to teach.

Does it generally measure the right things? Sure. Generally. But not exactly the right thing. It’s kinda measuring the wrong thing. Far from totally wrong, but not really the right thing. 

Like my wife’s sister, or maybe her identical twin sister. If my wife and her (fictional, btw) identical twin sister were raised in the same family and took all the same classes, would it be ok to only test her sister and then say that the scores and grades applied to my wife? Would that be accurate? How far off do you think the scores might be?

NAEP is measuring over there, but teachers are told to teach over here. They are close, but they are not the same. So, how much can we trust NAEP to reflect that real state of learning and proficiency of our students?

There is No Good News Here

There is no way to spin the latest NAEP results (math, reading) as good. The downward trends are concerning. But frankly, I have no idea the extent to which they merely represent divergence between CCSS and NAEP’s own alignment references. None. And I’ve seen few serious efforts to figure that out. The fact that the downward trends predate COVID, but postdate the widespread adoption of CCSS really concerns me. 

But down is down. It is not up. NAEPs measured constructs are similar enough to CCSS’s that I would hope to see increases in NAEP, even if they are attenuated from the actual learning and proficiencies of students. The problem is that it is quite easy to imagine and more and more refined efforts to address CCSS’s versions of the constructs result in more drift from NAEP’s foci. 

Nonetheless, the trends are national in scope. Virtually across the board. That ain’t good. Perhaps it is just a reflection of testing the wrong thing, but there’s no good evidence here, no good story to be told. 

I care most about making sure that the tests actually measure what they claim to measure—well, other than caring about the learning, development and health of children, of course. I think there a broad crisis in standardized tests misrepresenting what they actually mean with an audience that is hungry for particular meanings from those tests. If the tests are not measuring the right thing, the usefulness of that entire endeavor is questionable. The coverage we see of NAEP results does not account for this, which is perhaps evidence that we should not be paying any attention at all to them. If we cant’ get it right with NAEP, what hope is there for other assessments?

When state standards were so varied, NAEP offered a common yardstick to judge them all against. But the NAEP team’s conclusions on what to measure turned out different than the National Governors Association's and Council of Chief State School Officers’ team's conclusions about what to teach. So, my biggest concern is that NAEP’s supplementing other assessments with its own special strengths, NAEP is arrogantly sticking to its own construct definitions. 

There is a path forward for NAEP. In a country without federal power over standards or curriculum, NAEP should acknowledge the hard work of states and their leadership—and the goals of schools and teacher. Then, we might actually get more value from it. 

Diversity in Large Scale Assessment Development

Standardized tests try to assess specific knowledge, skills and/or abilities (KSAs), but quite often cannot do that directly. First, they cannot actual read minds. Second, many of these KSAs need to be observe in some sort of use, as opposed to the purely isolate skill. Math standards specifically call for applying KSAs “in context,” meaning word problems built of little stories. 21st century science assessment often take some scientific phenomenon and describe some real world—perhaps every day sort of—scenario that test takers need to analyze to recognize and apply the science KSAs to. ELA reading passages are also set in contexts.

Understanding any of these requires knowledge of context and culture. What kind of language is appropriate to use? What background knowledge do test takers have? What examples are easy to recognize and unpack, and which take more work? 

But we are an incredibly diverse nation, with children growing up in different contexts, and therefore with different background or common knowledge. 

Some know what a mensch is, and some know what collards are.

Some know that gravy is brown and goes on meat and mashed potatoes, and some know what gravy is red and goes on pasta. But some know that that it is white and goes on biscuits

Some know what a cul de sac is, and some know what a (building) super is. 

Some have a Nana, some have a Gram, and some have a Grandma

Some grow up playing in woods and creeks, and some grow up around turnstiles and transfers

Some had back yards and others had front stoops. Some know the differences between a porch and a deck.

There are so many dimensions of diversity, rather few of which we actually capture in the official records of demographics. I knew that not all doctors are medical doctors. I had a back yard, behind which were woods. I had a Nana. But I didn’t really know the difference between porch and deck or what a stoop was. 

The most defining characteristics of large scale standardized assessment is that it is given to incredibly diverse ranges of test takers—be they K-12 assessments or professional licensure exams. This is supposed to be true of the sorts of psychological exams that I do not work on, as well. 

Thus, because the testing population is so diverse, it is absolutely vital that those who create and evaluate tests understand the range of perspectives and experiences among test takers. Perhaps not individually—as that is an awful lot for any one person to know—but at least as a team. It is important that individual test developers and evaluators continue to broaden and deepen their understandings of the test taking population, and therefore the they have people to learn from. That is, the work simply requires diverse teams so that the even more diverse testing population can be anticipated. 

Otherwise, we can only develop tests that assume the kind of knowledge and understandings that we ourselves had as those points in our lives, without appreciating how that construct irrelevant knowledge and understanding acts as barriers to other sorts of test takers’ ability to demonstrate what they can and cannot do, what they do and do not know. 

Setting aside any moral obligations to employees or potential employees—really, just set that aside—we cannot develop effective assessments without diverse teams. We need test developers with diverse backgrounds themselves, and experience working with an even broader range of test takers. The lack of diversity among test developers has long been a weakness that undermines the validity of any use or purpose for our assessment products. We need to do better, not retrench into the worst habits of the past.

The Supremacy Clause and Executive Orders

The United States Constitution. Federal law. Federal regulations. Executive orders. Individual discretion. State and local authorities of various sorts. In our system of the rule of law and federalism, there is a clear hierarchy among them.

I. The United State Constitution

The US Constitution (1989) is supreme. No laws or regulation or government action are allowed to defy the dictates of the US Constitution. This is why unconstitutional is a death sentence for any government action. On the other hand, the Declaration of Independence (1776) has no authority. It was piece of rhetoric, an announcement and justification of rebellion (e.g. “he has savaged our coasts, burnt out towns and destroyed the lives of our people). It was our nation’s founding document, but it was not our government’s founding document. There were the old Articles of Confederation, and those were replaced by the US Constitution. The Constitution changed the nature of our government—state and federal. It is the ultimate authority.

Amendments to the Constitution are part of the Constitution. They change the contents and meaning of the Constitution. They have higher standing within the Constitution than the original text, as they come later. Though difficult to do, anything in the constitution can be changed or overridden by constitutional amendment—but not by any other mechanism or authority.

Military and civilian officers in our government swear an oath to uphold our constitution. The US Constitution is the basis for our whole government.

II. Federal Laws

Though they may not violate or override the Constitution, federal laws—passed by Congress and either signed by the president or passed through Congressional override of a veto—are superior to every thing else. In fact, this is stated explicitly in the Constitution, in the Supremacy Clause (Article VI, Clause 2).

This Constitution, and the Laws of the United States which shall be made in Pursuance thereof; and all Treaties made, or which shall be made, under the Authority of the United States, shall be the supreme Law of the Land; and the Judges in every State shall be bound thereby, any Thing in the Constitution or Laws of any State to the Contrary notwithstanding.

Laws require either the cooperation of both houses of Congress and the President or an overwhelming 2/3 vote of each house of Congress. Treaties begin with the President and then go to Congress for ratification. The authority of the federal laws must overcome our system of checks and balances, but once they do they are incredibly powerful.

Laws are necessary because the Constitution cannot predict everything or settle things in sufficient detail. For example, the Constitution says we need a system of federal courts and there will be a Supreme Court. But how many other courts and how many justices and how it all will work? Well, that was set up in federal laws, and in subject to alteration by federal law. The greater ease of making laws allows the government to respond to the needs of the nation without resorting to the far greater hurdle of constitutional amendment, and to correct previous actions as well.

III. Federal Regulation

Just as the Constitution cannot anticipate and describe everything in sufficient detail to handle our needs, federal laws also are limited. They need further detail and explanation than Congress can provide—and perhaps ought to provide. Thus, we have a system of federal regulation the fill out those details. For example, what counts as pollution and what are the acceptable levels?

There is a process laid out by Congress for how federal regulations are made, in the Administrative Procedure Act (APA) of 1946. Thus the process for the administrative branch under the President to pass federal regulations was crafted by Congress. Again, checks and balances.

This APA requires delay in crafting regulations. Drafts must be published publicly, and time given for the public to comment on them. There also are procedures for rescinding longstanding regulations, and enacting and resciding federal regulations take over a year. However, new regulations can be overturned by either the President or Congress. Checks and balances, and making changes takes cooperation between the executive and the legislature—at least tacitly.

IV. Executive Action and Executive ORders

The executive branch of the federal government comprises virtually every agency, department and office that actually does or enforces anything. Almost any agency, department or office that you have heard of is part of the executive branch, and therefore under the control of the President. In some cases, the President merely appoints its leaders, in most cases the President can remove them. Heck, in most cases, the President can give them orders—though not all.

But the government is far too big for the President to micromanage every action and decision. Heck, a single store or restaurant is usually too big for its leader to micromanage every action and decision. A small chain adds layers of central leadership over branch leadership. And a larger company or conglomerate adds more layers, still. The President has to rely on layers of leadership and management and line employees to actually get anything done, be it inspecting a meat processing plant, stamping a passport, collecting a tariff or charging a hill. All of those workers at all levels should act in accordance with the Constitution, federal laws and federal regulations, but there is still leeway, discretion and decision-making.

However, discretion is the doorway to discrimination, inconsistency and unpredictability. So, most departments, agencies and offices have internal guidance or guidelines that they publish. This provides transparency to whomever is interested or potentially impacted, and addresses those issues of inconsistency and discrimination. Of course, all of this must be consistent with the Constitution, federal law and regulations.

Sometimes, the President might want to set his own guidance or guidelines. That is is what an executive order is. There might be questions of priorities or timing. There is all kinds of wiggle room and various sorts of decisions that are not set in law ore regulations. Anyone who has been responsible for implementing a plan knows that there are always a host of little decisions that add up to making a huge difference in how it works out—little things that were not accounted for in the plan. Sometimes, they were merely overlooked, and other times the planners thought it best to leave the decision until the exact particulars were known. That means leaving in the hands of the folks doing the work at the time, instead of tying their hands in advance.

The more grey areas in the law or regulations, or the more contradictions within or between different laws (or different regulations), the more room there is for executive orders. But they must be consistent with the Constitution, federal law and even federal regulation. That is what it means to be a system of law, to have the rule of law. Individuals—even the President of the United States—is bound by the law. This includes being bound by the Administrative Procedures Act about creating and rescinding regulations. No president can issue orders with the force of law, or even federal regulation. No president can override federal law unilaterally, and certainly cannot override the Constitution. Just as the President must spend the money budgeted by Congress on the uses laid out by Congress, the President cannot simply erase laws or create new ones by fiat.

Use of executive orders can be quite controversial. When there is too much to address, the president might set priorities. For example, given the limited capacity of our immigration courts, the President might prioritize the removal of serious criminals, recent immigrants or less recent immigrants. If Congress increased the resources available for our immigration courts, then everyone in our country in violation of our laws could be processed and adjudicated. But without those resources, there must be some decisions made about where to direct them. Such decisions can be controversial. But this is really no different than a local police department deciding where to position officers and which crimes to focus on. For example, which speeding or shoplifting offenses to let go, or how many resources to devote to solving a particular burglary. Many executive orders seems obviously good and righteous to some, even as they seem obviously bad and immoral to others. They are a set of consequences of democracy and election outcomes. So long as they are consistent with the Constitution and ederal law (and even federal regulation), they are within the power of the President.

V. Recent Limitations on Executive Authority

In fact, the discretion of executive branch officials has recently been diminished by the US Supreme Court. Since last years’s Loper Bright v. Raimondo ruling, our federal courts have been instructed to give less deference to regulatory and guidance decisions from executive branch officials. The courts have also, in recent years, limited the executive branch from making regulations that are literally consistent with federal law and the Constitution when it judges that an issue is so important that Congress must be actively involved in deciding it—as opposed to delegating it to the executive branch.

That is, more authority for Congress and the Courts, and less for the executive branch.

VI. What About States

States can layer on top of federal law, but cannot override it. So long as the Constitution allows the federal government to take action or regulate and area, federal law is supreme. Federal law can allow states to add, but does not always do so. There is a federal minimum wage, but states can set higher minimum wages. The FDA approves drugs for sale nationally (and under what sorts of labelling and marketing), but states cannot approve drugs in addition to those approved by the FDA.

Even federal regulations, which are enacted under the auspices of federal law and serve to clarify how the federal law will be enforced and interpreted, are supreme to state law or state constitutions. That is, in areas covered by federal law, federal law and regulation decide whether states can add additional anything, or whether federal law (and/or regulations) simply preempt state efforts.

Of course, to the extend that the executive branch has less authority to pass regulations, issue guidance or executive orders, that might give states more leeway.

VII. Really?

Well, that is how it worked last week. It is how has worked for centuries, decades, years and months (as laid out above).

Hopefully, we remain a nation of laws.

The Original Sin of Large Scale Educational Assessment

The Standards for Educational and Psychological Testing explain five "sources of validity evidence,” on pages 13-21.

  • Evidence Based on Test Content

  • Evidence Based on Response Processes

  • Evidence Based on Internal Structure

  • Evidence Based on Relations to Other Variables

  • Evidence for Validity and Consequences of Testing 

Only one of these is really about even moderately sophisticated psychometrics: Evidence Based on Internal Structure. The others are either content based or rely on other sorts of statistical techniques. But evidence based on internal structure gets at some real issues in psychometrics. It is easy to understand, as it has the shortest explanation of the five potential sources of validity evidence. For example, the first of its three paragraphs says:

Analyses of the internal structure of a test can indicate the degree to which the relationships among test items and test components conform to the construct on which the proposed test score interpretations are based. The conceptual framework for a test may imply a single dimension of behavior, or it may posit several components that are each expected to be homogeneous, but that are also distinct from each other. For example, a measure of discomfort on a health survey might assess both physical and emotional health. The extent to which item interrelationships bear out the presumptions of the framework would be relevant to validity (p. 16).

And yet, the practice of developing, administering and reporting large scale standardized educational assessment seems to have mostly abandoned this form of validity evidence—the only form that really gets at psychometric issues. 

Straightforward examination of domain models (e.g., state learning standards) immediately reveals that these tests are supposed to measure multi-dimensional constructs. Those who know the constructs and content areas best are quite clear that these constructs (i.e., content areas) are multidimensional, with different students doing better in some areas and worse in others. They require an array of different sorts of lessons and ought to be measured with an array of different sorts of questions. 

I was taught that this kind of psychometric analysis is really about factor analysis of some sort. Which items tend to lean into which factors—dimensions—and then qualitative content-based analysis to confirm that this is as it should be. Heck, the basic question of whether the hypothetical dimensionality of the construct is reflected in the empirical dimensionality of the instrument…well, I was taught that that is really important. And The Standards seems to say that, too. 

But instead of ensuring that the dimensionality of the instrument matches the dimensionality of the domain model, the dominant mode in large scale educational assessment has an almost knee-jerk reliance on unidimensional models. Heck, items that fail to conform to this demand are discarded, as model fit statistics are the ultimate determinant of whether they can be included on a test (form). Such statistics are used to ensure that the dimensionality of the instruments does not match that of the construct. 

This use of such statistics combine with the use of unidimensional models to ensure that tests are not valid, by design. It ensures that domain models will be reread, reinterpreted and selected from only insofar as they can support the psychometric model. The tail wags the dog. 

There are many issues with large scale assessment that cause educators, learners, parents and the public to view them as “the enemy,as Steve Sireci observed in his 2020 NCME Presidential Address. But if I had to pick the single most important one, this would be it. Multiple choice items are problematic, but it quite often is possible to write good multiple choice items that i) reflect the content of the domain model, ii) prompt appropriate response processes, iii) combine for an internal structure that resembles that of the domain model, iv) combine to have appropriate relationships to other variables, and v) support appropriate inferences and consequences. But none of those are possible while insisting that items and tests are not allowed to match the structure of the domain model. This is not simply about ignoring the domain model, as some sort of neglect. Rather, this is active hostility that affirmatively bars using it as the primary reference for test development.   

Looking for DIF or other invariances that suggest fairness issues is not enough, so long as the structure of the domain model itself is barred from properly influencing test construction, as The Standards say it should.

To state this more plainly, this practice sets psychometric considerations as the main obstacle to developing valid tests—or tests that can be put to any valid use or purpose.

Difficulty and Rigor

“Difficult” is a problematic word, in the world of teaching, learning and assessment. It refers to many related ideas, but because they all use the same word people often conflate them. (I just learned this year that that is called a jingle fallacy.)

As a former high school teacher, I can think of a whole bunch of different meanings that I might have intended from time to time.

  • Something is difficult if it takes a long time to do.

  • Something is difficult if it is unlikely to be successful.

  • Something is difficult if most people—or at least very many people—are likely to fail at it.

  • Something is difficult if I, as a teacher, have to spend more time teaching it in order for my to student develop proficiency.

  • Something is difficult if it marks such a change from earlier lessons that I need work hard teach my students to adopt a different mindset for it than they’ve had before.

  • Something is difficult if it is easily mistaken for something else, and therefore likely to attempted with the wrong tools.

  • Something is difficult if a person simply does not have any experience with it.

  • Something is difficult if someone has never had to do it themself before.

  • Something is difficult if few people will ever be able to do it.

  • Something is difficult if the path to gaining proficiency is quite long.

  • Something is difficult if precursors are not taken care of.

This incomplete list contains many different ideas, some of which overlap, some of which address radically different aspects of difficulty than others. Some of them might be viewed as contributors to difficulty and some as different conceptions of difficulty.

Large scale assessment (LSA) has a very particular idea of difficulty. In this context, difficulty is measured empirically. It has nothing to do with teaching effort or learning paths. Rather, it is simply the share of test takers who responded to an item correctly. Concepts and lessons do not have difficulty, just individual items. Because seemingly minor alterations to an item can radically alter how many test takers answer successfully, difficulty must be measured through field testing and monitored through operational use of an item.

In a teaching and learning context, however, this empirical difficulty is not actually a fixed attribute of an item. It is the result of the efforts of teachers and students. When teachers spend more time on some difficult content or lesson—or perhaps come up with a great new way to teach it—students can become more successful learners and then more successful test takers.

Ross Markle explains that educational (and other) interventions can undermine the psychometric properties of tests (and items), including when those interventions are prompted by those test. An item might be quite difficult (empirically) one year, but because teachers respond to that with new efforts the next year, test takers might be much more successful the next year. Other items like it might be empirically much more difficult in later years, perhaps to the surprise of test developers.

Dan Koretz has long pointed to how teachers respond to test items by altering their teaching to better prepare students for precisely those approaches to assessing content. Alternatively, one reasonable use of LSAs is to evaluate whether curricula support the development of targeted proficiencies. Thus, this kind of feedback loop can undermine teaching and learning, and it can support teaching and learning.

(Of course, all of this violates psychometric assumptions of unidimensionality, but that’s neither here nor there. )

From time to time, there is great talk about needing more rigor in school and especially in assessment. That’s the word that people use, “rigor.” But we think that people really mean difficulty. And we think they mean empirical difficulty. They want harder tests that produce more failing results. They appear to mean that a better test is one that produces lower scores.

We think that that is garbage. We think that a better test is one that better reflects the actual proficiencies of test takers with elements of the domain model—such as the state learning learning standards for that subject and grade.

And frankly, we do not think that rigor is about empirical difficulty or conceptual difficulty. We do not approve of using “rigor” as a euphemism for “hard.” Rather, we think that rigorousness is something like:

  • extremely thorough, exhaustive, or accurate;

  • strictly applied or adhered to;

  • severely exact or accurate; precise.

Perhaps large scale assessment should be rigorous. They certainly should be accurate. We would prefer that they be exhaustive, but do not see that they are given enough testing time and opportunities to do that. But those seem like reasonable goals.

Their empirical difficulty should be driven by the nature of the content they are supposed to reflect and the teaching and learning of that content. It should not be a goal of the vindictive or be targeted by psychometric considerations that are not deeply based in issues of content, of teaching and of learning.

Certainly, however, test development should be rigorous. Our test development processes should be demanding, and our own professional standards should be strictly adhered to. That is where we would—and do—apply the word “rigor”

An Economist at Thanksgiving

I had an economist at my Thanksgiving table last week.

That’s not a joke or punchline. That’s just a fact. She doesn’t seem like an economist. She is remarkably charming, warm and personable. So, I usually don’t think of her as being an economist.

What do I mean by that, by “being an economist.” Well, in my experience I have found the field of economics to suffer from the greatest degree of disciplinary arrogance of any field. It appears that they train and acculturate this belief that their toolbox is so outstanding that there is no need to learn—or even respect—the substantive expertise of those in other disciplines. Disciplinary Arrogance.

Now, I am not what I would consider highly expert in school dress code policies. I have been following the issue for 30-40 years. I occasionally read professional research articles on the topic. I follow news links on new developments or examples. But I have not done my own original research or any sort of exhaustive review on the literature—though I sure have read more than one literature review from others over the decades. School dress codes and uniforms is one of those areas of policy and policy implementation that run right into my own beliefs on adolescent identity and social development. So, this topic brings a few professional interests together, even though I do not actually focus my own work on it.

But the economist has a different relationship to this topic. For some second-hand personal reason, she did (or trusted) a little bit of internet research on the topic and was highly confident that she understands the subtleties and complexities of how the law interacted with a particular charter school’s school uniform policy implementation. She thought she understood what the New York Board of Regents had required of public schools.

It was not a public school. It was charter schools, but she did not understand that charter schools are not public schools. (When I pointed out that there are simply privatized provision of traditionally publicly delivered services, she immediately got the point. She’s an economist, so she was already familiar with that dynamic, though she did not recognize it herself). It was not the Board of Regents, but was instead a new policy from the city’s Department of Education prompted by a New York City Council bill.

Most importantly, she did not understand the substance of the DOE guidelines. She did not know the history of the issues, the problems that the new guidelines were specifically designed to address or the significance of the particular language within the guidelines (e.g., “distracting). She thought that she understood it all, for having read the guidelines and applying her own knowledge and thinking to the document. She thought that her reading was accurate and her conclusions correct. She insisted on it. She refused to accept that she could have been wrong.

But expertise matters. Which means that intellectual and disciplinary humility matter. While playing in someone else’s backyard—and expression I think I learned from Andrew Gelman—can be fun, it is important to listen and to be mindful of the limits of one’s own expertise. It is important to learn from those with greater expertise, usually by asking questions. This requires acknowledging the expertise of others.

Which, unfortunately, is something that economists generally are not very good at.

The Overarching Goal of Schooling

For my first masters degree, I had to do a master’s thesis.

I wrote about the importance of a school having a purpose, or what I called “a school-wide working philosophy.” Because I graduated from college certified to teach (i.e,. back in the 20th century), this thesis came ten years after I began my teaching career. I had worked in an enough schools to have seen some stuff.

I was not particular about the purpose. Rather, I was concerned that when different teachers and different programs in a school are aimed at different purposes, they undermine each other. For example, athletics can undermine academic classes when coaches pressure teachers to raise a student’s grade so that they meet grade requirements for eligibility to play. Traveling debate teams can undermine academic classes, too; when any extra-curricular activity pulls students out of class for a road trip of any sort, students miss lessons for their academic classes.

Test scores and deep content lessons are not generally aligned. Core academic lessons and developing an outstanding college applicable are not necessarily aligned. Social development is something else, entirely.

Today, decades later, I still believe in program alignment. I still believe that schools should be clear about their priorities and what they are trying to do for students. I still believe that the hodgepodge of different programs with different allegiances among students and among educators is dangerous.

But I think that today I am more accepting of the existence of multiple goals. This makes the alignment problem even more challenging. This makes leaders’ roles even more important, as they try to build and manage a school or system that supports multiple goals without them undermining each other. The decisions are tougher, not easier, this way. And leaders still need to say “No” sometimes.

Nonetheless, I still believe in an ultimate and overarching goal for our schools. I do not believe that school exist primarily to deliver some benefit(s) to individual students. That is not a good enough reason to require all children to attend or to tax everyone to pay for this enormously expensive endeavor. Rather, schools exist to serve communities.

Therefore, whatever programs or purposes that schools might have, they should be to serve the community. Is students’ social development important? Surely. Preparation for citizenship? What could be more important than that? Preparation for economic contributions? Yes, that is important, too.

This is my lens. I think this is always my lens, the ultimate goal of schools and schooling. The overarching goal. Not merely a justification other try to hang things from, but rather a value with which to judge efforts in and around schooling. How can this school and its program support the communities it serves?

Therefore, schools must be under the control of communities. Strong influence from the local community. Influence from the regional (e.g., state) community. And so long as we are such an interconnected country, influence from the national level, as well. They must be communal affairs, and not perverted into serving private individual interests.

We already have private schools that are designed to serve individual interests. No, they are not good for communities. They are not good for democracy. They are not good for a pluralistic nation, region or community. Therefore, it is even more important that our public schools be preserved and supported. Therefore, it is vital that we protect them from privatizing interests.

The Road from Validity (Part II): Cementing the Replacement of Validity

Last month, I explained how large scale human scoring departs from carefully constructed rubrics and procedures to be replaced with something different. The desire for reliability and procedures to favor reliable scorers (i.e., those most likely to agree with other raters) serves to push out those who use the valid procedures and favor those who use the faster procedures.

Of course, we do not live in an age of large scale human scoring of test taker work product. We have replaced most human scoring with automated scoring. There are algorithms and artificial intelligence engines of various sorts that to the scoring, instead. They are be trained by examples of human scoring, but the bulk of the work is done by automated scoring.

How do we know whether the automated scoring engines are any good? Well, their creators, backers and proponents proclaim again and again that they do what human scorers do, only better. Not just faster and cheaper, but better.

What does better mean? Well, they use the same idea of reliability to judge the automated engines as they do to judge human scorers. The make sure that the automated scoring engines have high agreement rates with the pool of human scorers on some common pool of essays. That sounds fair, right? They use the same procedures used to judge human scorers and decide who is invited back to judge the machines.

This means that the same dynamic that replaced validity with reliability—that replaced the valid scoring procedure with the faster scoring procedure—is used to tune the algorithm.

No, automated scoring is not better at using the valid procedure—not faster, cheaper or more diligent. No, it means that automated scoring is better at using the faster procedure.

Of course, once the faster procedure is coded into the algorithm, it is harder to question. This is yet another form of algorithmic bias. Algorithms trained on biased data will perpetuate those biases in perpetuity, rather than highlight them. In this case, it just repeats the biases that make up the deviations from the valid procedure. Whatever was harder or slower or more nuanced in the valid procedure that made it slower is left out.

No, machines do not make mistakes. They do what they are told to do. The human scorers who depart from the valid (and slower) scoring procedures are making mistakes. But the machines are simply doing what they are told to do—trying o match the results of the human scorers.

And their developers and proponents brag on that, as though doing what they are told to do is necessarily doing the right thing. They fail to audit their training data (i.e., the results of human scoring) for validity, trusting in their beloved reliability. So, their algorithms cement the biases and departures from validity and hide them behind claims of reliability—as though reliability is the same thing as validity.

Not intentionally. Perhaps not quite inevitably. But quite consistently. Quite reliably.

The Road from Validity (Part I): How Reliability Replaces Validity


Imagine that you have some complex multidimensional trait that you want to test for. 

A. You get a bunch of experts and ask them to rate the recorded performances of test takers…only they don’t agree on scores. They rate the performances inconsistently because they do not agree on how to weight the different contributing factors and dimension. They are all experts, so they are not wrong, but you judge the inconsistency a problem. 

B. So, you train them. You provide them rubrics and exemplars. You tell them how they should be scoring. You try to standardize their judgments. 

C. Then, you want to remove the outlier scorers and keep the ones who are consistent with each other. You want comparable scores, so you can explain what the scores actually mean, what they refer to. You don’t want that mess of inconsistency, which you believe does not actually help anyone. 

To ensure that test takers are rated fairly and outlier scorers are caught and removed, you have every performance rated by two different scorers, selected at random. If they agree, that’s good. If they disagree, that’s bad. Over the course of a day or a week, each scorer rates a whole bunch of performances, so you can check their overall rate of agreeing with their randomly selected scoring peers. The ones who tend to agree more are the ones you keep, and ones to seem to march to their own beat more often are not invited back. 


D., You are actually running a business or have some sort of budget. You want the scoring done as quickly as practicable. And the scorers? This job is rather repetitive and boring, so they don’t focus as well on each individual performance when they are rating so many of them. These are two different pressures for speed, both top-down and bottom-up. Some scorers go faster, taking little shortcuts and using their own heuristics in place of the full formal procedures laid out in step B. 

Is that real scoring expertise at work? Is that developing skill? Or is it replacement of the official procedures with something else a bit faster?

E. Let’s look more closely at what happens to the agreement rates. We’ll use easy round numbers to make the analysis simpler. Imagine that half of the scorers take that faster route, and half go by the officially sanctioned formal route. And imagine that the faster route is twice as fast. That means that the faster route is going to score twice as many performances and the slower group. That means that everyone’s randomly selected scoring partner is twice as likely to be a fast scorer than a formal scorer. 

In our little simplified thought experiment, these two groups are using two different methods of scoring, and they sometimes will differ in the scores they report. Whether I am a fast scorer or a formal scorer, 2/3 of time, I will be randomly paired with a fast scorer. I will have better agreement rates if I use the fast scoring method than the formal scoring method. The fast scorers will have higher agreement rates than the slow scorers. 

The fast scorers will be retained and the formal scorers will be replaced, and some the replacements will opt for the faster method. This will further lower the formal scorers’ agreement rates and further increase the faster scorers agreement rates. 

*********************

What has happened to your construct? What are the test takers performances being judged against? Would you even notice the shift?

Would you notice use of reliability to evaluate scores drove a shift from the formal and documented scoring procedures that were designed to best evaluate the construct to some—perhaps obvious—shortcuts that do not consider all of the dimensions and subtly of the construct? 

The Unacknowledged Tyranny of the Platinum Standard

I just got back from an educational research conference, and as is my wont, I had a lot of conversations about assessment and educational measurement.

On the morning of the last day of the conference, as people were saying their goodbyes, I found myself in conversation with a brilliant young psychometrician still on her first job in industry. I was pushing her to consider examining the application of some sort of multi-dimensional psychometric model when she got a chance to do her next little research project. She was concerned that that might mark her as being a little weird, as the industry is so heavily invested in unidimensional psychometric models. She pulled in Yu Bao, a professor in James Madison University’s Assessment & Measurement program, who was walking by. Yu agreed with me that there’s a lot of room there for a psychometrician to make their name there with multi-dimensional models.

I went on one of my typical rants about the mismatch between unidimensional psychometric models and multi-dimensional domain models and the platinum standard. That is, the way that psychometricians bring model fit statistics to data review meetings and strongly suggest that items with poor model fit—poorly fit to the misplaced unidimensional model—be removed. (They do this with item difficulty statistics, too, but that is not as bad for validity claims as this use of model fit statistics from inappropriate models.)

This young psychometrician pushed back, however. She said that psychometrics uses unidimensional models because they fit the data better.

But that’s not true. That’s not true in practice and that is not true at research conferences. Just the previous day, a colleague of mine told me about a session that he attended—and walked out of. There, a young psychometrician was explaining the use of factor modeling techniques to something something something—I didn’t attend that session, so I do not know what he was trying to do. He showed that item 31 did not fit his model, so they removed it. They did not remove it because it was not well aligned to the assessment target or larger macro-construct. He never even looked at the actual item itself. Rather, they just removed the item because it did not fit the psychometric model they were using.

No consideration for the construct’s theorized model. No consideration of the formal domain model. Only the psychometric model.

My colleague was walking by, so I pulled him into the conversation. Yu agreed that this happens sometimes, despite what the brilliant young psychometrician had been taught and expected to see.

My colleague and I know that this happens quite a bit. Psychometricians come with their opaque techniques and intimidatingly precise numbers. Few people outside of psychometrics have the confidence to push back against people armed with something they do not understand, and that precision—and all of its decimal places—is so easily mistaken for accuracy.

The platinum standard is powerful. It shapes our assessments, and not for the better. It leads to the removal of possibly the best aligned items, simpy because they do not accord to the demands of psychometrically more simple tools and their rather unrealistic assumptions. The platinum standard forces those inappropriate assumptions on the entire field, requiring those who actually focus on the domain model and content alignment to simply accept the demands of psychometrics.

Content Development Professionals Require Different Expertise than Teachers

Because assessment is an important part of teaching, it is not surprising that content development professionals (CDPs) for large scale assessment (e.g., standardized testing)—the professionals who develop and refine the contents of tests—require many of the same skills, knowledge and expertise as teachers. However, CDPs also need other skills and knowledge to work at a high level.

Unlike teachers, CDPs do not have to worry about classroom management or lesson plan construction. But they do need a teachers’ understanding of the content area and how to think about content and learning paths. Of course, both teachers and CDPs need to understand the cognition of others without unwittingly projecting their thinking on them. 

However, CDPs need to think about these things a bit differently than teachers. Large scale assessment does not have as many opportunities to assess students as classroom practice, so it must do so much more efficiently. Teachers can triangulate lots of different information from and about students to figure out whether they understand something, but large scale assessment usually depends on a single assessment to make that inference. Therefore, CDPs need a much more precise view of evidence than teachers’. They need to be able to recognize the ambiguity of evidence so that they can create test items that can elicit more definitive evidence of a test takers’ level of proficiency.  

While teachers often focus on how to integrate the knowledge, skills and abilities found in various learning standards into larger lessons and activities, CDPs need to understand how to isolate them while still preserving some amount of authenticity. They need to be particularly mindful of the kinds of mistakes that learners make and how they relate to particular learning goals—recognizing their connection to the targeted cognition of an item.  

Like teachers, CDPs need to understand how others think—often others very different than themselves. Teachers have their students in front of them, and can learn more about them over time. CDPs have to imagine test takers, rather than being reminded of them every day by their presence. Moreover, the range of diversity and dimensions of diversity are vastly larger with large scale assessment than a single teacher in a single school must account for. We call this radical empathy because of the amount of variation in background, experience and perspective that CDPs must be able to anticipate. 

CDPs also require technical knowledge and skills that teachers do not. CDPs need to know the differences between different item types, how they work, and which are most appropriate to elicit evidence of different sorts of cognition. They need to be able to recognize problems in a multiple choice item and how to make it better elicit evidence of the targeted cognition for the range of typical test takers. The fact is that it is incredibly difficult to create high quality multiple choice items that produce high quality evidence, a fact that makes it more important to take those challenges seriously. 

They need to understand the workflows, contributors and collaborations that comprise the test development process. Moreover, they need to have the ability to push back against the various pressures to alter items in ways that compromise they ability to elicit evidence of that targeted cognition for the range of typical test takers—or even omit them entirely from an assessment. Of course, all of this requires understanding the values and thinking of the the many different disciplines that contribute to these collaborations.

I would never suggest that CDP work is more difficult or complex than teacher work; clearly it is not. Working with children—of any age—and being sensitive to their needs is enormously challenging and complex work, made more so the official and unofficial learning goals. All of those challenges are magnified exponentially by the reality of how many children are there at the same time. 

However, the work of developing the contents of standardized tests is itself complex and difficult, mostly in ways unappreciated by the public—and even by others involved in large scale assessment. It leans on many areas of skill and knowledge that overlap with teachers’ expertise, but it has different goals and constraints. Therefore, it also requires different expertise—including, but not limited to—expertise in the content area.

Content Development Work Requires Far More than Just Content Expertise

There is a cynically and incredibly foolish expression, “Those who can, do; those who can’t, teach.” Yes, it is grounded in the idea that many teachers are not the deepest content experts. However, it is foolish for two reasons. First, it entirely misses the fact that good teaching requires its own set of skills—skills that mere content experts usually lack. Second, and perhaps less obviously, it misses the fact that teachers must have a different relationship to the content than mere practitioners—even those at the highest level of experts who practice with the most nuance, skill and wisdom.

Teaching requires thinking about the content, holding it at arms length, rather than just using it. Some (e.g., teachers) call this meta-cognition, thinking about thinking. Being able to do does not require consciously understanding what it is you are doing or being able to communicate to others. In fact, that kind of thinking can get in the way of fluid skillful practice. It does not require understanding how others might do the skill. Teachers have to understand the kinds of mistakes that learners make, and the different learning paths towards proficiency. 

Content development work for large scale assessment (e.g., standardized testing)—the work of crafting and refining the contents of tests—requires many of the same skills as teaching. It requires thinking about the content. Like teachers, content development professionals (CDPs) need to understand how others understand the content, and the ways in which they might misunderstand or misapply it. They need to be able to recognize their own thinking and learning path, but not be so self-centered as to assume that it is the only learning path. Like good teachers, they need the empathy to imagine the cognitive paths of others—including those with vastly different backgrounds and experiences. 

Yes, and like teachers, CDPs need content expertise. And like teachers, they need far deeper content expertise than most people realize. They need to understand how the content fits together and how it is applied in practice. They rarely have the fluidity of a practitioner’s mastery at the highest level of professional practice, but they have deep understanding of content, nonetheless. 

And they also need many of the expertises of teachers, in addition to their own particular additional skills, knowledge and abilities

Has IRT Destroyed Old School Validity?

When I first learned about the measurement idea of validity, I was taught that it is about whether the measure actually measures the construct. I was taught that validity pairs with reliability, which is about how consistent the measure is. Reliability is like the margin of error of from one measurement to the next, but validity is whether you’re aiming of the right target. I have had this definition in my head for…decades? I think I first learned about this in a psychology class in the 1980’s. 

When I came to the field of educational measurement this century, I found a somewhat different definition. Since the 1999 Standards for educational and psychological assessment (AERA, APA, NCME), in educational measurement validity is about whether there is evidence to support the use of a test or measurement for a particular use. We all stress that validity is no longer a property of the test itself, but rather is a property of the particular test use. And there are different kinds of evidence that can contribute to this idea of test validity.

That shift to attention on particular test uses is really important. When tests are repurposed, they might no longer be appropriate. For example, a very good test of 4th grade mathematics is simply not a good test of 6th grade mathematics. It is not that the test has changed, but rather that its use has changed. So, the old validity evidence for the old use is no longer germane. 

I buy that. I really do. But I still have in my head the issue of the basic inference. That is, totally apart from test use, does the test actually measure what it claims to measure? Are the inferences we make about test takers…valid? I think that that still matters.

In fact, I think that whether the tests measure what they are supposed to measure is the real point. I think that that old school idea of validity as simply the question about whether the test measures what it is supposed to measure is critically important. If it does then appropriate uses are kinda of obvious. And inappropriate uses are also kinda obvious.

So why the shift from the 1985 Standards to the 1999 Standards?

I have a theory that is probably incorrect. But it’s in my head.

For decades, the statistics behind big standardized tests has been based on something called IRT (item response theory) and before that it was based on CTT (classical test theory). Each of these generally reports a single score that is useful for sorting and ranking test takers. No matter how many different elements the test is supposed to measure—like the different standards in a domain model—they each report a single unified score. However, for them to work reliably, test developers remove potential test items that seem like they might be measuring something a little different than the other items. So, the better that each item does at measuring its targeted standard, the less likely that item is to be included. The more an item instead kinda measures some muddled middle idea of the construct, the more likely it is to be selected. Psychometricians call that “model fit,” and the model is usually unidimensional IRT or CTT. 

When there is a conflict between a multi-dimensional domain model (e.g., the different knowledge, skills and abilities that go into a set of standards) and a unidimensional psychometric model, modern educational measurement favors the unidimensional model—throwing aside items that better fit the domain model than the psychometric model.

As a content person, I have never been able to figure out what that mushy middle means. On a 4th grade math test, it’s some vague idea of math ability…but it’s not clear which aspect of math ability factor in and which do not. It might include ability to learn math. But how much? It might include computational accuracy. But how much? It might include problem solving ability, but how much? Or even reading ability! Because model fit statistics lead to the removal of really strongly targeted items (i.e., as opposed to items that lean towards the muddled middle), I don’t think we could ever know.

These technique produce a seemingly definitive ranking of test takers with seemingly definitive quantitative scores—often to many decimal places. But it is never clear what they are ranked on. Something about math…but what? They most definitely are not a thoughtfully weighed composite score when IRT is combined with item selection and model fit statistics. 

Which takes me back to the question of old school validity vs. news school educational measurement test validity. Was the change necessary simply because we never know what IRT is scoring students on, from a content perspective? That is, IRT results are not interpretable through the lens of the construct, so we not longer focus on the inference?

That’s what I am thinking about, these days.

Are we measuring the right construct?


Imagine that you are in a kitchen and need to measure the volume of some odd solid object, or the difference in volumes between two odd solid objects. But the only real measuring tools are scales (i.e, a kitchen scale and a bathroom scale) and any number of household tape measures, rulers and yard/meter sticks. And the internet is down.

* One approach might be to simply take the mass of the object(s) and figure that most things have a mass of around 1 g/cm3, and go with that. If you need the difference, take the difference. 

* Another approach might be to do that Archimedes thing and try displacement. Fill up a cup or larger container to the bring with water, drop the object in the cup and catch all the water that the new object forces out of the cup. That would take a saucer (or serving platter) under the vessel to catch the water. Measure the mass of that saucer (or serving platter) empty and with the water. Eureka! The difference is the volume, so long as you convert the units, right? So clever, that Archimedes. 

* The third, and hardest approach would be very much like the second approach, but it departs from the Archimedes version, because these objects are not gold crowns. You’d need to push the object down into the water, making sure that it is entirely submerged—but without putting anything else in the water. Either push it JUST under the water, or use some very very fine tools to hold it down further. Again, calculate the mass of the displaced water and convert the units. That’s the mass of the object, and just subtract the lower mass from the greater if comparing two objects.

The third approach is way more clever than the first two, and is the only one that will actually give you volume. The first approach approximates volumes, but will not work for objects that easily float or sink—signaling a density significantly different than water’s. The first approach just gives you mass. The second approach will work for denser objects, which do entirely submerge in the water, but not for objects that float (i.e., are not entirely submerged). For the former, yes volume. But for the latter it just gives mass again. Not actually as clever as we thought. 

(Archimedes’s experiment as a bit different, and he had a whole bunch of spare gold lying around. Neither you nor I have that available for our work.)

I have no doubt that there are many people who think that psychometrics is analogous to the third approach. That it really is clever enough to take the products of limited tools to measure difficult constructs. But what I see is  a dependence of limited tools that simply measure something different than the intended construct. And, no, the analysis is not so clever as successfully to convert the results to the intended construct. Disturbingly, it is not that adequate tools are not available, rather it’s the insistence on using unidimensional psychometric models and filters to measure multi-dimensional constructs. There are other models, they just are not favored. Perhaps they are not as easy to use. Perhaps they don’t have the established place in curricula and/or practice. Perhaps it is simply if we've always tended to use a hammer, we tend to redefine problem into problems that can be solved with a hammer. 

But the charge of testing is to measure the intended construct, not some other construct that our favored tools are better at measuring.