What Does Unidimensionality Feel Like?

[This is the year of addressing unidimensionality.Here is this month’s installment.]

Unidimensionality can feel good. It is a simplifying assumption that can make a complex set of data or concepts far easier to digest and make sense of. 

An inevitable part of becoming expert in anything is the realization that things are more complex than one had realized previously. Potters think about the many qualities of the clay they work with that can contribute to the overall quality of the clay, and they understand that that question of overall quality is really more context- and goal-specific. That is, it is not really ever about quality, but rather about qualities. The same is true for professional chefs and their knives, because a different knives offer a different balance of qualities. This is true for inputs and true for outputs. It is certainly true for the subjects of educational assessment. The more expert you are, the more dimensions you see and factor into account. 

But not everyone has the expertise to recognize all those dimensions. Perhaps more importantly, not everyone has the expertise to process and consider what all of those dimensions mean in the context of each other. It is simply information overload—again and again and again.

Most of us have some area in which we are expert or real connoisseurs. There is something that we care enough about to have devoted the ability to comfortably take in and make sense of a large amount of information. We understand what it means and have the schemas to process it together for our various purposes. But this contextual expertise does not make it so easy or comfortable to take in complex information of other sorts.

And so, we resort to simplifying assumptions when working outside of our own areas of expertise. In part, this saves us time. In part, it saves of aggravation and frustration. But mostly, it enables us to make some sense of the complexity, as opposed to simply being overwhelmed or paralyzed. 

So, what some people see is a ridiculous oversimplification, others see as a necessary simplification. For some, it turns the apparent chaos into something intellectually manageable, and that feels good. Flattening out details, simplifying, reducing complexing are all coping strategies for the overwhelmed, and therefore they feel good—even necessary. 

Well, that’s one perspective. 

To experts, to people who have the schemas and experience to have a grip on the complexity of the many factors and various dimension of the situation, unidimensionality is frustrating in a very different way. It is not merely a simplification, but rather the greatest oversimplification possible—reducing everything to just one dimension. It looks like willful ignorance. It can feel like an attack on one’s values and expertise. It’s the frustration of knowing that an approach is usually going to produce wrong answers, and just get lucky every now and again.

To some, it offers there relief of being able to produce any answers at all, and to others it offers the frustration of knowing the answers it offers will usually miss the point. 

To an educator or parent, it is important to know which things a student is good or bad at, and perhaps how good or bad. Companies do not hire people based on GPAs (i.e., grade point averages) or WAR (i.e., Wins Above Replacement), as they care which knowledge, skills and abilities job candidates have. Doctors do not make treatment decisions based on one simplified overall health score. No one whom we trust to make important decisions for us or our loved ones does so based on one unidimensional overall scale—and when we asked them for advice or to explain, we do not want to hear “Well, because the overall score of everything is [x], you should do [blah blah blah].” Rather, we want to understand more than that, and we want to decision to be based on greater understanding than that.

So, what does unidimensionality feel like? Well, at first and to non-experts, it feels good. It feels like the solution to frustration. But to experts or to those invested in the quality of a decisions or outcome, it feels even more deeply frustrating.

What do AI-generated item measure?

The absolute most important question about any test result is what it actually means. The first sentence of the first chapter of The Standards for Educational and Psychological Testing point to "the degree to which evidence and theory support the interpretations of test scores for proposed uses of tests,” and calls this validity. 

To understand what a test result means, we have to understand what it measures in the aggregate—which means we have to understand what it means on the item-level. There is no magic that can make a whole measure something that the individual items do not. There’s no way to figure out what that overlap of a bunch of disparate items means, as the non-overlap creates huge errors and if you do not know what individual items measure you cannot figure out what the overlap measures.

This is the question of item alignment. What do the items—the building block of any assessment—actually measure. Do they actually measure what they are supposed to measure? How do we figure that out? What are the common pitfalls and mistakes that can undermine such investigations?

The last couple of years have seen a huge increase in interest in AI-generated items, sometimes what a human-in-the-loop and sometimes not. We’ve read papers and seen presentations, but the evaluation of what these item actually measure has been…disappointing. We’ve seen the same mistakes that novice content development professionals learn not to make repeated as though they are standard practice. For example, many AI researchers in educational measurement only evaluate the stem of a multiple choice question without considering the answer options or the cognitive paths that might lead to an incorrect answer. Again and again, researchers who do not understand how potential test takers learn particular material or the mistakes they actually make offer their less-than-expert opinions on the KSAs the an item requires. 

When challenged on this, they told me that they couldn’t find anything in the literature on item alignment. So, I spent a very frustrating few weeks going through the educational measurement literature and texts to see what it had to offer on this question. And they were right. Quite a bit on blueprint- test- or form-alignment. Some dimensions of what might considered (e.g., Webb) when rolling up item alignment decisions into test alignment determinations, but nothing on how to make those item-level judgments. There simply is not a literature on item alignment.

But AI generated items are useless if they do not actually measure what they are supposed to measure. Bad building blocks cannot fulfill the requirements of test blueprints and can produce indecipherable test results. Well, they could produce fraudulent test results that simply do not report on what they claim to report, and suggest inferences for which there simply is not sufficient evidence or theory to support.

So, here is a review of item alignment. Here are the basic considerations of how to determine whether an item is aligned to its alignment reference—be it a standard, an assessment target or something else. If we going to be evaluating the potential of AI-generate items, we really need to be rigorous in our evaluation of the products they provide—the items!

Item Alignment: Understanding the Quality of the Evidence that Items Elicit

Alignment—the mapping between test items and their intended constructs—is central to test validation but remains understudied at the micro level of individual items. This paper examines how judgments about item alignment are made in practice, analyzing five common misconceptions: ignoring item modality, ignoring alternative cognitive paths, ignoring additional KSAs, lack of deep expertise with the domain model, and failing to consider the diverse range of test takers. We frame these issues using Type I (false positive) and Type II (false negative) errors in inferences about test-taker proficiency at the micro-level of individual alignment references (e.g., standards). We further explore the nature and impact of different sources of additional KSAs. The paper further examines challenges in alignment within a standard, including difficulty, learning pathways, components of complex standards, and text complexity. Despite the importance of targeting the core rather than margins of standards, numerous factors incentivize alignment with the less important margins of a standard, including ease of item development, psychometric pressures, and naïve misreadings of standards by non-experts. We argue that improved alignment requires recognizing the distinct requirements of large-scale standardized assessment and bridging disciplinary training gaps between psychometric perspectives and content development expertise to improve the quality of evidence elicited by test items.

What is Unidimensionality?

What does it mean to be good at math? There are students who were good at math before they hit algebra, and then struggled. There are students who were good at math, but just weren’t great at the proofs of geometry class. There are kids who were good at math until they hit calculus. There are kids who are good at math, but just can’t do word problems. There are kids who are good at math, but keep making slopping mistakes. There are kids who are good at math so long as they have already learned how to solve that kind of problem, but particularly struggle when faced with novel problems. 

So, what does it mean to be good at math?

Can a student be really good at math if they struggle with algebra, proofs, calculus, word problems, novel problem and make sloppy arithmetic mistakes? Clearly not. These things are all aspects of mathematics. The best math students excel at all of them and the worst excel at none. But most students are better at some and worse at others. 

When we have large constructs (e.g., math), but students different in which parts they are better at and which parts they are worse, it is multi-dimensional. Math is not one thing; it is not unidimensional. 

English Language Arts is not just one thing, either. Reading is not just one thing, and neither is writing. One can be a good speller, but have poor command of conventions of formal grammar. One can write good sentences, but struggle with developing a single cohesive paragraph. One can struggle to but together a cohesive piece that organizes and ideas and supports for it. And quite differently, one can write imaginatively—a certain kind of creativity. One might be good at writing evocative descriptions, or real-seeming characters. One might imagine interesting plots, or write realistic dialogue. Reading also has many components that different readers are better or worse at.

Not only can people differ in which dimensions of a larger construct they are good at, the kinds of lessons and practice that might help to improve offer differ from dimension to dimension even within a construct. Learning to be a better speller is a very different process than learning to write real-seeming characters. Learning to be more careful with arithmetic is different than learning to solve word problems. 

The thing is, it’s not just that mathematics is multi-dimensional. Even arithmetic is itself multi-dimensional. Even multiplication is multi-dimensional. Even single-digit multiplication is multi-dimensional. When someone learns their multiplication tables, they can be better at some parts of it than others. 2’s, 5’s and 10’s are easy. The others…well, there are tricks and there is memorization. If we all focused on 8’s first, we might know them better than 6’s, but we tend to focus on 6’s before 8’s. Eventually, however, when we are past those learning stages, we process all of that complexity more automatically and the dimensionality of multiplications tables reduces. It might even become unidimensional, differing by our level of command with the individual differences that we had when we were first learning them. Some people know them all and the ones who don’t tend to make the same mistakes. That is, once it is safe to assume that we obtained the level of proficiency with single-digit arithmetic that we are going to obtain, it is unidimensional—but that is past the point when it is a skill worth measuring.

So, some people remain better at algebra, while others might remain better at the reasoning skills of  proofs, and others better at the diligent care of avoiding slopping mistakes. Similarly, some writers are better at dialogue, others at character and others at plot. Moreover, science, social studies, foreign language, psychology, each sport and most everything is actually multi-dimensional. 

Even sprinting—running a footrace—is multi-dimensional. Track and & field coaches talk about the biomechanics of i) the start, ii), acceleration, iii) drive and iv) deceleration—though some think there are more and some think there are fewer dimensions. Thinking through this example puts a lie to the idea that unidimensionality can be meaningfully built of a constant combination of separate components. Different sprint distances (e.g., 10m, 40m, 100m, 200m) each constitute a different ratio of these different components, and the is not an absolute or definitive reference for what ratio represents sprinting. It is always an arbitrary decision which one to favor. 

So, from an educational measurement perspective what is unidimensionality? If we care at all about the substance and what we are measuring, then unidimensionality is an arbitrary fiction created to serve some convenience—and perhaps never even able to serve that convenience well. 

The Importance of Accessibility in Assessment Development

The education sector generally and assessment specifically should understand why accessibility is important. The Individuals with Disabilities Education Act ensures that students with disabilities are able to access appropriate educational opportunities. Psychometrics talks about construct-irrelevant variance in reference to how things we are not trying to assess might impact test results. Psychometrics calls those things “irrelevant.”

Employment law also addresses this topic. One of the foundational ideas of the Americans with Disabilities Act is that a disability should not disqualify someone from taking part in life or having a job. For examples, so long as they can perform the major functions of a job, employers are required to provide reasonable accommodations so that they can do the work.

Of course, there’s that old universal design idea that making things easier for people with disabilities makes them easier for everyone. It really can be win-win. Famously, curb cuts (i.e., the ramps now built into sidewalk at street corners, cutting through the regular curb) that were originally intended to help the disabled in wheel chairs turn out to help many many others. People with wheeled suitcases. People with rolling carts. People wearing high heels. People carrying bulky loads that make it hard to look down. Anyone with sore or stiff knees, such as those with injuries or just the wear and tear of age. This is such a clear example of accessibility enhancement that the whole idea is called the curb cut effect.

My own first real exposure to assistive technology was in the early 1990’s. The family of a friend of mine was involved in early version of voice dictation software. This was before Windows, back in the DOS world. My friend asked me to help her at assistive technology trade show, and I demonstrated this amazing program, Dragon Dictate. I could speak (not quite continuously) and it would type my words. I could use a whole DOS computer with it. Though expensive, this technology could enable people to work jobs they might not be able to, otherwise. They could make economic contributions, and support themselves in the process.

All of that is about accessibility. But all of that is the moral case, not the business case. That is about why it is good to help other people who might need just a little assistance. Right now, it seems that some look down on such a value.

This series on DEIA (i.e., disability, equity, inclusion and accommodation) is about why it is good for the assessment industry and our products. So, why is accessibility in our own practices good for our products?

Well, as much as the RTD Project talks about the importance of empathy and the practice of radical empathy, they are not easy. Truly understanding someone else’s perspective, understanding how their experiences give them different views and understandings than we ourselves have, is hard. It is work. It takes time, information and even instruction. We can try to imagine, but there is nothing like asking others to help us to understand something, and listening to experts who know more than we do. In assessment development work, we simply must understand the different perspectives of our test takers if we are going to be able to develop instruments that assessment at all accurately.

The law requires us to test all students—or virtually all students. Professional licensure exams must be available to all potential test takers—virtually regardless of disabilities. How can we develop assessments with valid uses and purposes if they do not work for such a significant portion of the test taking population. If we mis-measure the proficiencies of the disabled, how can tests that have any component of norm-based scoring or reporting—as so often enters the standard setting process, even when we try to keep it out—deliver accurate results of any test takers?

So, we need room on our teams and in our own organizations for the disabled. People whose own lived experiences face different constraints than I do will notice things that I might not. Test delivery platforms might operate different for them in ways I do not notice. Contexts for math problems might have assumptions that I take for granted, but others do not understand. And the perspective expanding conversations and lessons I get from learning from colleagues with some disabilities can make it easier for me to understand or imagine perspectives of people with other disabilities. If I let them, they can get me out of my own perspective into a more open-minded space of empathy. They can help me to better ensure that the instrument in front of me is more focused on what it is supposed to assess.

Disability is just another dimension of diversity—or a bunch of different dimensions. The relatively minor efforts to make our workplaces and workflows accessible to people with disabilities enhances the effectiveness of our teams and our products just like the inclusions of other dimensions of diversity does. Perhaps it only gets its own letter in DEIA because some people too much limit the dimensions of diversity they consider.

Democracy and Education Research

Whether you believe in market-based approaches to improving our schools or more traditional approaches, it is vital that the public know about the functioning and effectiveness of our schools. Markets require informed consumers, and democracy requires informed voters. Neither system of accountability can function effectively—let alone efficiently—without information. 

This is why I work in large scale assessment. I believe that our public schools are the most important service that our governments provide. A vast majority of our children, of our citizens, go through them. Our schools prepare the next generation for citizenship, for economic participation and to be members of our communities. The moral legitimacy of our public schools comes from the same place as the moral legitimacy as all of our governments’ actions: the will of the governed. I believe in school board elections because our schools are so important that our communities should vote on them on their own, rather than part of the larger bundle we consider when voting for mayors, city council members, governors, legislators and presidents. 

Frankly, we all need better information about the functioning of our schools, because we all pay taxes to support them. Our property values and rents are influenced by perceptions of the quality of our schools. And the future of our communities and our children are strongly shaped by them.

Obviously, schools are not the only influence on these things. In fact, our children’s future are more shaped by non-school factors than in-school factors. But other than family, schools might be the most important factor. (When churches influence children, they primarily do so through the behavior and teachings of parents, who are so credible and important to children.) Schools are vital institutions in our communities, second only to families in how we shape our communities for the future. 

We need more information about our schools, not less. And certainly we need more than quantitative statistics and test scores. But those quantitative statistics and tests scores can be useful. They are hard to do well—thus, my work—but easy to consume. They are quick information for people who might not have the time, patience or interest to delve more deeply into rich qualitative findings about schools. No doubt, we need qualitative and quantitative reporting on more than just core academic lessons, as we want schools to do more than just teach those core academic lessons. Character, citizenship, mental and emotional health, resiliency, collaboration skills, emotional intelligence and more. But certainly, we all want to know how our schools are doing with those academic lessons that are at the center of so much that schools do. We need better tests, better reporting, and about a richer array of educational outcomes.

To abandon public reporting on our schools is, in my view, to abandon any investment in improving our schools. It undermines the basic engines of school improvement, be they grounded in democratic oversight or in market mechanisms. I  know of no moral call greater than trying to do better by today’s children and even better still by tomorrow’s children. This calls for investments to learn how we can do better by our own, our community’s and our nation’s children. 

No one voted for abandoning such efforts. We already spend so little on education research, making our research efforts so much more difficult than they should be. Cutting education research funding is a statement that no one’s children are worth investing in or improving for, not even our own. I can think of no more immoral view than that. I have all kinds of criticisms about what NCES and IES do and the research they fund, but those are primarily in terms of the important types of research that goes unfunded, rather than taking issue with the importance of the research that they do fund. Cutting this research is giving up on the most fundamental infrastructure of our society.

If an unfriendly foreign power had attempted to impose this on America, we might well have viewed it as an act of war.

(With apologies to John Dewey.)

Inclusion in Assessment Development: Making Use of Diversity

It should not be hard to understand the meaning of inclusion in assessment development, as so many of us have been classroom educators. 

For classroom educators, inclusion means including special education students in regular classrooms, lessons and activities—rather than in the building, but it in self-contained classrooms. It is about including those students where the main action is, rather than marginalizing them over there in some other part of the building.

This same logic applies in the workplace. It is not enough to merely include diversity in the organization if it is marginalized over there. It’s not enough that it is listed on paper as being part of the team, but not in the room where issues are discussed. It is not enough if it is not at the table where decisions are made. 

Inclusion is about actually taking advantage of the potential of diversity on our teams to help our projects and our products. 

Obviously, this matters quite a bit when to comes to writing assessments for the diverse range of test takers who take our products. If our diverse voices cannot be heard appropriately, then the promise of enlisting them in the first place was met. I would suggest that disciplinary lens is another dimension of diversity that should be acknowledge and considered in the context of inclusion. Discussions of issues and decisions need to be open to those diverse voices, or else their knowledge and potential contributions will be wasted, and our products will suffer. 

I suppose that this is an aspect of balancing confidence and humility, of knowing when to listen—which requires ensuring that those other voices are present for discussion and decisions. If we did not have a history of marginalizing some voices and excluding some perspectives from the room and table, this would not be notable. But we do have those histories, so we need to be careful to break those old patterns and establish new norms for how we ensure that our product (and decisions) are able to leverage the potential of the diversity within our teams. 

Because of longstanding norms of power and who is centered, this requires intentional effort and attention to ensure that those voices are truly included. Because this is about cultural norms and power—yes, it really is about power—efforts to truly include those voices and perspectives take more work and more difficult work than those who have always been included realize. It takes more than merely literally including people in the room and having them at the table. It takes the work of giving them the confidence to speak up and the work of giving the others the humility to listen. 

But it is all worth doing because it products better products that have a better chance of being put to some valid use and/or purpose.

Is It Time to Just Ignore NAEP?

Should we be paying this much attention to NAEP? I don’t think so.

Differing Standards

Are you an expert in anything? What do you think the important knowledge and skills in that topic are? Could you make a list of them—an organized and detailed enough list to guide years of instruction on that topic?

Let’s imagine cooking. Here are some questions you’d need to figure out

How important is baking? How much might you want to focus the skills and knowledge of baking breads? Cookies? 

Roasting? (What is the difference between roasting and baking, anyway?) What are all the important skills of roasting meat? Roasting vegetables? Roasting pastas—or is that baking?

Grilling? Is that the same as barbecuing? What are the important skills and knowledge there? Still gotta cover gas, charcoal and wood? 

What are the important skills and knowledge around salads? What is a salad, anyway? 

Old school skills: aspic? Jello mold? What about them?

What about principles of healthy cooking? What are those? Are they worth including? In what year? What are the skills and knowledge?

Blooming spices? Is that on your list? Should it be? 

Reusability of parchment paper? How to clean cast iron pans? How to season them? How to clean a blender properly? How to make clear ice cubes?

The thing is, my year by year list of critical knowledge, skills and abilities would be different than my co-author’s, and different from my wife’s. And different than yours. Two really good lists could still differ significantly—even radically.

We used to have more variation across the states when it comes to state learning standards for math and reading. We have less variation today, in large part because of the widespread adoption and adoption of the Common Core State Standards (CCSS). One would think that that eases the problem.

NAEP is Not Aligned to Common Core

The problem is that the widespread influence of CCSS has not really gotten to NAEP. For NAEP to align to Common Core would break comparability over time. Measure something else, even something only moderately different, and you should not compare the old scores to the new scores. All those longitudinal sequences would break—and longitudinal sequences is a big raison d'être of NAEP.

There is lot that I really like about NAEP. It is very well designed, produced and implemented assessment. It’s just testing the wrong thing.

But it does not measure what teachers are told to teach.

Does it generally measure the right things? Sure. Generally. But not exactly the right thing. It’s kinda measuring the wrong thing. Far from totally wrong, but not really the right thing. 

Like my wife’s sister, or maybe her identical twin sister. If my wife and her (fictional, btw) identical twin sister were raised in the same family and took all the same classes, would it be ok to only test her sister and then say that the scores and grades applied to my wife? Would that be accurate? How far off do you think the scores might be?

NAEP is measuring over there, but teachers are told to teach over here. They are close, but they are not the same. So, how much can we trust NAEP to reflect that real state of learning and proficiency of our students?

There is No Good News Here

There is no way to spin the latest NAEP results (math, reading) as good. The downward trends are concerning. But frankly, I have no idea the extent to which they merely represent divergence between CCSS and NAEP’s own alignment references. None. And I’ve seen few serious efforts to figure that out. The fact that the downward trends predate COVID, but postdate the widespread adoption of CCSS really concerns me. 

But down is down. It is not up. NAEPs measured constructs are similar enough to CCSS’s that I would hope to see increases in NAEP, even if they are attenuated from the actual learning and proficiencies of students. The problem is that it is quite easy to imagine and more and more refined efforts to address CCSS’s versions of the constructs result in more drift from NAEP’s foci. 

Nonetheless, the trends are national in scope. Virtually across the board. That ain’t good. Perhaps it is just a reflection of testing the wrong thing, but there’s no good evidence here, no good story to be told. 

I care most about making sure that the tests actually measure what they claim to measure—well, other than caring about the learning, development and health of children, of course. I think there a broad crisis in standardized tests misrepresenting what they actually mean with an audience that is hungry for particular meanings from those tests. If the tests are not measuring the right thing, the usefulness of that entire endeavor is questionable. The coverage we see of NAEP results does not account for this, which is perhaps evidence that we should not be paying any attention at all to them. If we cant’ get it right with NAEP, what hope is there for other assessments?

When state standards were so varied, NAEP offered a common yardstick to judge them all against. But the NAEP team’s conclusions on what to measure turned out different than the National Governors Association's and Council of Chief State School Officers’ team's conclusions about what to teach. So, my biggest concern is that NAEP’s supplementing other assessments with its own special strengths, NAEP is arrogantly sticking to its own construct definitions. 

There is a path forward for NAEP. In a country without federal power over standards or curriculum, NAEP should acknowledge the hard work of states and their leadership—and the goals of schools and teacher. Then, we might actually get more value from it. 

Diversity in Large Scale Assessment Development

Standardized tests try to assess specific knowledge, skills and/or abilities (KSAs), but quite often cannot do that directly. First, they cannot actual read minds. Second, many of these KSAs need to be observe in some sort of use, as opposed to the purely isolate skill. Math standards specifically call for applying KSAs “in context,” meaning word problems built of little stories. 21st century science assessment often take some scientific phenomenon and describe some real world—perhaps every day sort of—scenario that test takers need to analyze to recognize and apply the science KSAs to. ELA reading passages are also set in contexts.

Understanding any of these requires knowledge of context and culture. What kind of language is appropriate to use? What background knowledge do test takers have? What examples are easy to recognize and unpack, and which take more work? 

But we are an incredibly diverse nation, with children growing up in different contexts, and therefore with different background or common knowledge. 

Some know what a mensch is, and some know what collards are.

Some know that gravy is brown and goes on meat and mashed potatoes, and some know what gravy is red and goes on pasta. But some know that that it is white and goes on biscuits

Some know what a cul de sac is, and some know what a (building) super is. 

Some have a Nana, some have a Gram, and some have a Grandma

Some grow up playing in woods and creeks, and some grow up around turnstiles and transfers

Some had back yards and others had front stoops. Some know the differences between a porch and a deck.

There are so many dimensions of diversity, rather few of which we actually capture in the official records of demographics. I knew that not all doctors are medical doctors. I had a back yard, behind which were woods. I had a Nana. But I didn’t really know the difference between porch and deck or what a stoop was. 

The most defining characteristics of large scale standardized assessment is that it is given to incredibly diverse ranges of test takers—be they K-12 assessments or professional licensure exams. This is supposed to be true of the sorts of psychological exams that I do not work on, as well. 

Thus, because the testing population is so diverse, it is absolutely vital that those who create and evaluate tests understand the range of perspectives and experiences among test takers. Perhaps not individually—as that is an awful lot for any one person to know—but at least as a team. It is important that individual test developers and evaluators continue to broaden and deepen their understandings of the test taking population, and therefore the they have people to learn from. That is, the work simply requires diverse teams so that the even more diverse testing population can be anticipated. 

Otherwise, we can only develop tests that assume the kind of knowledge and understandings that we ourselves had as those points in our lives, without appreciating how that construct irrelevant knowledge and understanding acts as barriers to other sorts of test takers’ ability to demonstrate what they can and cannot do, what they do and do not know. 

Setting aside any moral obligations to employees or potential employees—really, just set that aside—we cannot develop effective assessments without diverse teams. We need test developers with diverse backgrounds themselves, and experience working with an even broader range of test takers. The lack of diversity among test developers has long been a weakness that undermines the validity of any use or purpose for our assessment products. We need to do better, not retrench into the worst habits of the past.

The Supremacy Clause and Executive Orders

The United States Constitution. Federal law. Federal regulations. Executive orders. Individual discretion. State and local authorities of various sorts. In our system of the rule of law and federalism, there is a clear hierarchy among them.

I. The United State Constitution

The US Constitution (1989) is supreme. No laws or regulation or government action are allowed to defy the dictates of the US Constitution. This is why unconstitutional is a death sentence for any government action. On the other hand, the Declaration of Independence (1776) has no authority. It was piece of rhetoric, an announcement and justification of rebellion (e.g. “he has savaged our coasts, burnt out towns and destroyed the lives of our people). It was our nation’s founding document, but it was not our government’s founding document. There were the old Articles of Confederation, and those were replaced by the US Constitution. The Constitution changed the nature of our government—state and federal. It is the ultimate authority.

Amendments to the Constitution are part of the Constitution. They change the contents and meaning of the Constitution. They have higher standing within the Constitution than the original text, as they come later. Though difficult to do, anything in the constitution can be changed or overridden by constitutional amendment—but not by any other mechanism or authority.

Military and civilian officers in our government swear an oath to uphold our constitution. The US Constitution is the basis for our whole government.

II. Federal Laws

Though they may not violate or override the Constitution, federal laws—passed by Congress and either signed by the president or passed through Congressional override of a veto—are superior to every thing else. In fact, this is stated explicitly in the Constitution, in the Supremacy Clause (Article VI, Clause 2).

This Constitution, and the Laws of the United States which shall be made in Pursuance thereof; and all Treaties made, or which shall be made, under the Authority of the United States, shall be the supreme Law of the Land; and the Judges in every State shall be bound thereby, any Thing in the Constitution or Laws of any State to the Contrary notwithstanding.

Laws require either the cooperation of both houses of Congress and the President or an overwhelming 2/3 vote of each house of Congress. Treaties begin with the President and then go to Congress for ratification. The authority of the federal laws must overcome our system of checks and balances, but once they do they are incredibly powerful.

Laws are necessary because the Constitution cannot predict everything or settle things in sufficient detail. For example, the Constitution says we need a system of federal courts and there will be a Supreme Court. But how many other courts and how many justices and how it all will work? Well, that was set up in federal laws, and in subject to alteration by federal law. The greater ease of making laws allows the government to respond to the needs of the nation without resorting to the far greater hurdle of constitutional amendment, and to correct previous actions as well.

III. Federal Regulation

Just as the Constitution cannot anticipate and describe everything in sufficient detail to handle our needs, federal laws also are limited. They need further detail and explanation than Congress can provide—and perhaps ought to provide. Thus, we have a system of federal regulation the fill out those details. For example, what counts as pollution and what are the acceptable levels?

There is a process laid out by Congress for how federal regulations are made, in the Administrative Procedure Act (APA) of 1946. Thus the process for the administrative branch under the President to pass federal regulations was crafted by Congress. Again, checks and balances.

This APA requires delay in crafting regulations. Drafts must be published publicly, and time given for the public to comment on them. There also are procedures for rescinding longstanding regulations, and enacting and resciding federal regulations take over a year. However, new regulations can be overturned by either the President or Congress. Checks and balances, and making changes takes cooperation between the executive and the legislature—at least tacitly.

IV. Executive Action and Executive ORders

The executive branch of the federal government comprises virtually every agency, department and office that actually does or enforces anything. Almost any agency, department or office that you have heard of is part of the executive branch, and therefore under the control of the President. In some cases, the President merely appoints its leaders, in most cases the President can remove them. Heck, in most cases, the President can give them orders—though not all.

But the government is far too big for the President to micromanage every action and decision. Heck, a single store or restaurant is usually too big for its leader to micromanage every action and decision. A small chain adds layers of central leadership over branch leadership. And a larger company or conglomerate adds more layers, still. The President has to rely on layers of leadership and management and line employees to actually get anything done, be it inspecting a meat processing plant, stamping a passport, collecting a tariff or charging a hill. All of those workers at all levels should act in accordance with the Constitution, federal laws and federal regulations, but there is still leeway, discretion and decision-making.

However, discretion is the doorway to discrimination, inconsistency and unpredictability. So, most departments, agencies and offices have internal guidance or guidelines that they publish. This provides transparency to whomever is interested or potentially impacted, and addresses those issues of inconsistency and discrimination. Of course, all of this must be consistent with the Constitution, federal law and regulations.

Sometimes, the President might want to set his own guidance or guidelines. That is is what an executive order is. There might be questions of priorities or timing. There is all kinds of wiggle room and various sorts of decisions that are not set in law ore regulations. Anyone who has been responsible for implementing a plan knows that there are always a host of little decisions that add up to making a huge difference in how it works out—little things that were not accounted for in the plan. Sometimes, they were merely overlooked, and other times the planners thought it best to leave the decision until the exact particulars were known. That means leaving in the hands of the folks doing the work at the time, instead of tying their hands in advance.

The more grey areas in the law or regulations, or the more contradictions within or between different laws (or different regulations), the more room there is for executive orders. But they must be consistent with the Constitution, federal law and even federal regulation. That is what it means to be a system of law, to have the rule of law. Individuals—even the President of the United States—is bound by the law. This includes being bound by the Administrative Procedures Act about creating and rescinding regulations. No president can issue orders with the force of law, or even federal regulation. No president can override federal law unilaterally, and certainly cannot override the Constitution. Just as the President must spend the money budgeted by Congress on the uses laid out by Congress, the President cannot simply erase laws or create new ones by fiat.

Use of executive orders can be quite controversial. When there is too much to address, the president might set priorities. For example, given the limited capacity of our immigration courts, the President might prioritize the removal of serious criminals, recent immigrants or less recent immigrants. If Congress increased the resources available for our immigration courts, then everyone in our country in violation of our laws could be processed and adjudicated. But without those resources, there must be some decisions made about where to direct them. Such decisions can be controversial. But this is really no different than a local police department deciding where to position officers and which crimes to focus on. For example, which speeding or shoplifting offenses to let go, or how many resources to devote to solving a particular burglary. Many executive orders seems obviously good and righteous to some, even as they seem obviously bad and immoral to others. They are a set of consequences of democracy and election outcomes. So long as they are consistent with the Constitution and ederal law (and even federal regulation), they are within the power of the President.

V. Recent Limitations on Executive Authority

In fact, the discretion of executive branch officials has recently been diminished by the US Supreme Court. Since last years’s Loper Bright v. Raimondo ruling, our federal courts have been instructed to give less deference to regulatory and guidance decisions from executive branch officials. The courts have also, in recent years, limited the executive branch from making regulations that are literally consistent with federal law and the Constitution when it judges that an issue is so important that Congress must be actively involved in deciding it—as opposed to delegating it to the executive branch.

That is, more authority for Congress and the Courts, and less for the executive branch.

VI. What About States

States can layer on top of federal law, but cannot override it. So long as the Constitution allows the federal government to take action or regulate and area, federal law is supreme. Federal law can allow states to add, but does not always do so. There is a federal minimum wage, but states can set higher minimum wages. The FDA approves drugs for sale nationally (and under what sorts of labelling and marketing), but states cannot approve drugs in addition to those approved by the FDA.

Even federal regulations, which are enacted under the auspices of federal law and serve to clarify how the federal law will be enforced and interpreted, are supreme to state law or state constitutions. That is, in areas covered by federal law, federal law and regulation decide whether states can add additional anything, or whether federal law (and/or regulations) simply preempt state efforts.

Of course, to the extend that the executive branch has less authority to pass regulations, issue guidance or executive orders, that might give states more leeway.

VII. Really?

Well, that is how it worked last week. It is how has worked for centuries, decades, years and months (as laid out above).

Hopefully, we remain a nation of laws.

The Original Sin of Large Scale Educational Assessment

The Standards for Educational and Psychological Testing explain five "sources of validity evidence,” on pages 13-21.

  • Evidence Based on Test Content

  • Evidence Based on Response Processes

  • Evidence Based on Internal Structure

  • Evidence Based on Relations to Other Variables

  • Evidence for Validity and Consequences of Testing 

Only one of these is really about even moderately sophisticated psychometrics: Evidence Based on Internal Structure. The others are either content based or rely on other sorts of statistical techniques. But evidence based on internal structure gets at some real issues in psychometrics. It is easy to understand, as it has the shortest explanation of the five potential sources of validity evidence. For example, the first of its three paragraphs says:

Analyses of the internal structure of a test can indicate the degree to which the relationships among test items and test components conform to the construct on which the proposed test score interpretations are based. The conceptual framework for a test may imply a single dimension of behavior, or it may posit several components that are each expected to be homogeneous, but that are also distinct from each other. For example, a measure of discomfort on a health survey might assess both physical and emotional health. The extent to which item interrelationships bear out the presumptions of the framework would be relevant to validity (p. 16).

And yet, the practice of developing, administering and reporting large scale standardized educational assessment seems to have mostly abandoned this form of validity evidence—the only form that really gets at psychometric issues. 

Straightforward examination of domain models (e.g., state learning standards) immediately reveals that these tests are supposed to measure multi-dimensional constructs. Those who know the constructs and content areas best are quite clear that these constructs (i.e., content areas) are multidimensional, with different students doing better in some areas and worse in others. They require an array of different sorts of lessons and ought to be measured with an array of different sorts of questions. 

I was taught that this kind of psychometric analysis is really about factor analysis of some sort. Which items tend to lean into which factors—dimensions—and then qualitative content-based analysis to confirm that this is as it should be. Heck, the basic question of whether the hypothetical dimensionality of the construct is reflected in the empirical dimensionality of the instrument…well, I was taught that that is really important. And The Standards seems to say that, too. 

But instead of ensuring that the dimensionality of the instrument matches the dimensionality of the domain model, the dominant mode in large scale educational assessment has an almost knee-jerk reliance on unidimensional models. Heck, items that fail to conform to this demand are discarded, as model fit statistics are the ultimate determinant of whether they can be included on a test (form). Such statistics are used to ensure that the dimensionality of the instruments does not match that of the construct. 

This use of such statistics combine with the use of unidimensional models to ensure that tests are not valid, by design. It ensures that domain models will be reread, reinterpreted and selected from only insofar as they can support the psychometric model. The tail wags the dog. 

There are many issues with large scale assessment that cause educators, learners, parents and the public to view them as “the enemy,as Steve Sireci observed in his 2020 NCME Presidential Address. But if I had to pick the single most important one, this would be it. Multiple choice items are problematic, but it quite often is possible to write good multiple choice items that i) reflect the content of the domain model, ii) prompt appropriate response processes, iii) combine for an internal structure that resembles that of the domain model, iv) combine to have appropriate relationships to other variables, and v) support appropriate inferences and consequences. But none of those are possible while insisting that items and tests are not allowed to match the structure of the domain model. This is not simply about ignoring the domain model, as some sort of neglect. Rather, this is active hostility that affirmatively bars using it as the primary reference for test development.   

Looking for DIF or other invariances that suggest fairness issues is not enough, so long as the structure of the domain model itself is barred from properly influencing test construction, as The Standards say it should.

To state this more plainly, this practice sets psychometric considerations as the main obstacle to developing valid tests—or tests that can be put to any valid use or purpose.

Difficulty and Rigor

“Difficult” is a problematic word, in the world of teaching, learning and assessment. It refers to many related ideas, but because they all use the same word people often conflate them. (I just learned this year that that is called a jingle fallacy.)

As a former high school teacher, I can think of a whole bunch of different meanings that I might have intended from time to time.

  • Something is difficult if it takes a long time to do.

  • Something is difficult if it is unlikely to be successful.

  • Something is difficult if most people—or at least very many people—are likely to fail at it.

  • Something is difficult if I, as a teacher, have to spend more time teaching it in order for my to student develop proficiency.

  • Something is difficult if it marks such a change from earlier lessons that I need work hard teach my students to adopt a different mindset for it than they’ve had before.

  • Something is difficult if it is easily mistaken for something else, and therefore likely to attempted with the wrong tools.

  • Something is difficult if a person simply does not have any experience with it.

  • Something is difficult if someone has never had to do it themself before.

  • Something is difficult if few people will ever be able to do it.

  • Something is difficult if the path to gaining proficiency is quite long.

  • Something is difficult if precursors are not taken care of.

This incomplete list contains many different ideas, some of which overlap, some of which address radically different aspects of difficulty than others. Some of them might be viewed as contributors to difficulty and some as different conceptions of difficulty.

Large scale assessment (LSA) has a very particular idea of difficulty. In this context, difficulty is measured empirically. It has nothing to do with teaching effort or learning paths. Rather, it is simply the share of test takers who responded to an item correctly. Concepts and lessons do not have difficulty, just individual items. Because seemingly minor alterations to an item can radically alter how many test takers answer successfully, difficulty must be measured through field testing and monitored through operational use of an item.

In a teaching and learning context, however, this empirical difficulty is not actually a fixed attribute of an item. It is the result of the efforts of teachers and students. When teachers spend more time on some difficult content or lesson—or perhaps come up with a great new way to teach it—students can become more successful learners and then more successful test takers.

Ross Markle explains that educational (and other) interventions can undermine the psychometric properties of tests (and items), including when those interventions are prompted by those test. An item might be quite difficult (empirically) one year, but because teachers respond to that with new efforts the next year, test takers might be much more successful the next year. Other items like it might be empirically much more difficult in later years, perhaps to the surprise of test developers.

Dan Koretz has long pointed to how teachers respond to test items by altering their teaching to better prepare students for precisely those approaches to assessing content. Alternatively, one reasonable use of LSAs is to evaluate whether curricula support the development of targeted proficiencies. Thus, this kind of feedback loop can undermine teaching and learning, and it can support teaching and learning.

(Of course, all of this violates psychometric assumptions of unidimensionality, but that’s neither here nor there. )

From time to time, there is great talk about needing more rigor in school and especially in assessment. That’s the word that people use, “rigor.” But we think that people really mean difficulty. And we think they mean empirical difficulty. They want harder tests that produce more failing results. They appear to mean that a better test is one that produces lower scores.

We think that that is garbage. We think that a better test is one that better reflects the actual proficiencies of test takers with elements of the domain model—such as the state learning learning standards for that subject and grade.

And frankly, we do not think that rigor is about empirical difficulty or conceptual difficulty. We do not approve of using “rigor” as a euphemism for “hard.” Rather, we think that rigorousness is something like:

  • extremely thorough, exhaustive, or accurate;

  • strictly applied or adhered to;

  • severely exact or accurate; precise.

Perhaps large scale assessment should be rigorous. They certainly should be accurate. We would prefer that they be exhaustive, but do not see that they are given enough testing time and opportunities to do that. But those seem like reasonable goals.

Their empirical difficulty should be driven by the nature of the content they are supposed to reflect and the teaching and learning of that content. It should not be a goal of the vindictive or be targeted by psychometric considerations that are not deeply based in issues of content, of teaching and of learning.

Certainly, however, test development should be rigorous. Our test development processes should be demanding, and our own professional standards should be strictly adhered to. That is where we would—and do—apply the word “rigor”

An Economist at Thanksgiving

I had an economist at my Thanksgiving table last week.

That’s not a joke or punchline. That’s just a fact. She doesn’t seem like an economist. She is remarkably charming, warm and personable. So, I usually don’t think of her as being an economist.

What do I mean by that, by “being an economist.” Well, in my experience I have found the field of economics to suffer from the greatest degree of disciplinary arrogance of any field. It appears that they train and acculturate this belief that their toolbox is so outstanding that there is no need to learn—or even respect—the substantive expertise of those in other disciplines. Disciplinary Arrogance.

Now, I am not what I would consider highly expert in school dress code policies. I have been following the issue for 30-40 years. I occasionally read professional research articles on the topic. I follow news links on new developments or examples. But I have not done my own original research or any sort of exhaustive review on the literature—though I sure have read more than one literature review from others over the decades. School dress codes and uniforms is one of those areas of policy and policy implementation that run right into my own beliefs on adolescent identity and social development. So, this topic brings a few professional interests together, even though I do not actually focus my own work on it.

But the economist has a different relationship to this topic. For some second-hand personal reason, she did (or trusted) a little bit of internet research on the topic and was highly confident that she understands the subtleties and complexities of how the law interacted with a particular charter school’s school uniform policy implementation. She thought she understood what the New York Board of Regents had required of public schools.

It was not a public school. It was charter schools, but she did not understand that charter schools are not public schools. (When I pointed out that there are simply privatized provision of traditionally publicly delivered services, she immediately got the point. She’s an economist, so she was already familiar with that dynamic, though she did not recognize it herself). It was not the Board of Regents, but was instead a new policy from the city’s Department of Education prompted by a New York City Council bill.

Most importantly, she did not understand the substance of the DOE guidelines. She did not know the history of the issues, the problems that the new guidelines were specifically designed to address or the significance of the particular language within the guidelines (e.g., “distracting). She thought that she understood it all, for having read the guidelines and applying her own knowledge and thinking to the document. She thought that her reading was accurate and her conclusions correct. She insisted on it. She refused to accept that she could have been wrong.

But expertise matters. Which means that intellectual and disciplinary humility matter. While playing in someone else’s backyard—and expression I think I learned from Andrew Gelman—can be fun, it is important to listen and to be mindful of the limits of one’s own expertise. It is important to learn from those with greater expertise, usually by asking questions. This requires acknowledging the expertise of others.

Which, unfortunately, is something that economists generally are not very good at.

The Overarching Goal of Schooling

For my first masters degree, I had to do a master’s thesis.

I wrote about the importance of a school having a purpose, or what I called “a school-wide working philosophy.” Because I graduated from college certified to teach (i.e,. back in the 20th century), this thesis came ten years after I began my teaching career. I had worked in an enough schools to have seen some stuff.

I was not particular about the purpose. Rather, I was concerned that when different teachers and different programs in a school are aimed at different purposes, they undermine each other. For example, athletics can undermine academic classes when coaches pressure teachers to raise a student’s grade so that they meet grade requirements for eligibility to play. Traveling debate teams can undermine academic classes, too; when any extra-curricular activity pulls students out of class for a road trip of any sort, students miss lessons for their academic classes.

Test scores and deep content lessons are not generally aligned. Core academic lessons and developing an outstanding college applicable are not necessarily aligned. Social development is something else, entirely.

Today, decades later, I still believe in program alignment. I still believe that schools should be clear about their priorities and what they are trying to do for students. I still believe that the hodgepodge of different programs with different allegiances among students and among educators is dangerous.

But I think that today I am more accepting of the existence of multiple goals. This makes the alignment problem even more challenging. This makes leaders’ roles even more important, as they try to build and manage a school or system that supports multiple goals without them undermining each other. The decisions are tougher, not easier, this way. And leaders still need to say “No” sometimes.

Nonetheless, I still believe in an ultimate and overarching goal for our schools. I do not believe that school exist primarily to deliver some benefit(s) to individual students. That is not a good enough reason to require all children to attend or to tax everyone to pay for this enormously expensive endeavor. Rather, schools exist to serve communities.

Therefore, whatever programs or purposes that schools might have, they should be to serve the community. Is students’ social development important? Surely. Preparation for citizenship? What could be more important than that? Preparation for economic contributions? Yes, that is important, too.

This is my lens. I think this is always my lens, the ultimate goal of schools and schooling. The overarching goal. Not merely a justification other try to hang things from, but rather a value with which to judge efforts in and around schooling. How can this school and its program support the communities it serves?

Therefore, schools must be under the control of communities. Strong influence from the local community. Influence from the regional (e.g., state) community. And so long as we are such an interconnected country, influence from the national level, as well. They must be communal affairs, and not perverted into serving private individual interests.

We already have private schools that are designed to serve individual interests. No, they are not good for communities. They are not good for democracy. They are not good for a pluralistic nation, region or community. Therefore, it is even more important that our public schools be preserved and supported. Therefore, it is vital that we protect them from privatizing interests.

The Road from Validity (Part II): Cementing the Replacement of Validity

Last month, I explained how large scale human scoring departs from carefully constructed rubrics and procedures to be replaced with something different. The desire for reliability and procedures to favor reliable scorers (i.e., those most likely to agree with other raters) serves to push out those who use the valid procedures and favor those who use the faster procedures.

Of course, we do not live in an age of large scale human scoring of test taker work product. We have replaced most human scoring with automated scoring. There are algorithms and artificial intelligence engines of various sorts that to the scoring, instead. They are be trained by examples of human scoring, but the bulk of the work is done by automated scoring.

How do we know whether the automated scoring engines are any good? Well, their creators, backers and proponents proclaim again and again that they do what human scorers do, only better. Not just faster and cheaper, but better.

What does better mean? Well, they use the same idea of reliability to judge the automated engines as they do to judge human scorers. The make sure that the automated scoring engines have high agreement rates with the pool of human scorers on some common pool of essays. That sounds fair, right? They use the same procedures used to judge human scorers and decide who is invited back to judge the machines.

This means that the same dynamic that replaced validity with reliability—that replaced the valid scoring procedure with the faster scoring procedure—is used to tune the algorithm.

No, automated scoring is not better at using the valid procedure—not faster, cheaper or more diligent. No, it means that automated scoring is better at using the faster procedure.

Of course, once the faster procedure is coded into the algorithm, it is harder to question. This is yet another form of algorithmic bias. Algorithms trained on biased data will perpetuate those biases in perpetuity, rather than highlight them. In this case, it just repeats the biases that make up the deviations from the valid procedure. Whatever was harder or slower or more nuanced in the valid procedure that made it slower is left out.

No, machines do not make mistakes. They do what they are told to do. The human scorers who depart from the valid (and slower) scoring procedures are making mistakes. But the machines are simply doing what they are told to do—trying o match the results of the human scorers.

And their developers and proponents brag on that, as though doing what they are told to do is necessarily doing the right thing. They fail to audit their training data (i.e., the results of human scoring) for validity, trusting in their beloved reliability. So, their algorithms cement the biases and departures from validity and hide them behind claims of reliability—as though reliability is the same thing as validity.

Not intentionally. Perhaps not quite inevitably. But quite consistently. Quite reliably.

The Road from Validity (Part I): How Reliability Replaces Validity


Imagine that you have some complex multidimensional trait that you want to test for. 

A. You get a bunch of experts and ask them to rate the recorded performances of test takers…only they don’t agree on scores. They rate the performances inconsistently because they do not agree on how to weight the different contributing factors and dimension. They are all experts, so they are not wrong, but you judge the inconsistency a problem. 

B. So, you train them. You provide them rubrics and exemplars. You tell them how they should be scoring. You try to standardize their judgments. 

C. Then, you want to remove the outlier scorers and keep the ones who are consistent with each other. You want comparable scores, so you can explain what the scores actually mean, what they refer to. You don’t want that mess of inconsistency, which you believe does not actually help anyone. 

To ensure that test takers are rated fairly and outlier scorers are caught and removed, you have every performance rated by two different scorers, selected at random. If they agree, that’s good. If they disagree, that’s bad. Over the course of a day or a week, each scorer rates a whole bunch of performances, so you can check their overall rate of agreeing with their randomly selected scoring peers. The ones who tend to agree more are the ones you keep, and ones to seem to march to their own beat more often are not invited back. 


D., You are actually running a business or have some sort of budget. You want the scoring done as quickly as practicable. And the scorers? This job is rather repetitive and boring, so they don’t focus as well on each individual performance when they are rating so many of them. These are two different pressures for speed, both top-down and bottom-up. Some scorers go faster, taking little shortcuts and using their own heuristics in place of the full formal procedures laid out in step B. 

Is that real scoring expertise at work? Is that developing skill? Or is it replacement of the official procedures with something else a bit faster?

E. Let’s look more closely at what happens to the agreement rates. We’ll use easy round numbers to make the analysis simpler. Imagine that half of the scorers take that faster route, and half go by the officially sanctioned formal route. And imagine that the faster route is twice as fast. That means that the faster route is going to score twice as many performances and the slower group. That means that everyone’s randomly selected scoring partner is twice as likely to be a fast scorer than a formal scorer. 

In our little simplified thought experiment, these two groups are using two different methods of scoring, and they sometimes will differ in the scores they report. Whether I am a fast scorer or a formal scorer, 2/3 of time, I will be randomly paired with a fast scorer. I will have better agreement rates if I use the fast scoring method than the formal scoring method. The fast scorers will have higher agreement rates than the slow scorers. 

The fast scorers will be retained and the formal scorers will be replaced, and some the replacements will opt for the faster method. This will further lower the formal scorers’ agreement rates and further increase the faster scorers agreement rates. 

*********************

What has happened to your construct? What are the test takers performances being judged against? Would you even notice the shift?

Would you notice use of reliability to evaluate scores drove a shift from the formal and documented scoring procedures that were designed to best evaluate the construct to some—perhaps obvious—shortcuts that do not consider all of the dimensions and subtly of the construct?