What is the Purpose of a University?

Last month, I declared that it is our job to teach. If law schools expect their students to tolerate (or even protect the rights of) those making low quality and/or offensive arguments – perhaps even made in bad faith – then they should teach their students how to do so. It is not enough simply to demand that they do so. They should explain why this matters and how to put up with such garbage.

Again, demanding proficiency with important skills is not what educators do. Instead, we teach. Or, at least, that’s what I think.

But perhaps I am wrong. Perhaps I misunderstand law schools. Perhaps I misunderstand universities.

And so, I am asking myself, what it is the purpose of a university? What should the priority of a university be?

  • To teach and educate students?

  • To protect free speech?

  • To create knowledge?

It matters how an institution might prioritize these different ideas because occasionally they may come in conflict.

Or, perhaps I am painting with too broad a brush? Perhaps we can ask:

  • What is the purpose and/or priorities of undergraduate institutions, among educating students, protecting free speech and creating knowledge?

  • What is the purpose and/or priorities of professional schools (e.g., medical, law, education schools), among educating students, protecting free speech and creating knowledge?

  • What is the purpose and/or priorities of graduate schools of arts & science (i.e., the folks who do PhD’s in traditional disciplines like history, philosophy, biology, psychology), among educating students, protecting free speech and creating knowledge?

Perhaps, being a K-12 educator, I mistake the purpose of these…what do we call them? “Institutions of higher learning?” Isn’t that the traditional name?

I am trying to approach this issue with intellectual integrity and intellectual humility. I highly value free speech — that’s a big reason why I have been a member of the ACLU for decades. I am trying to question my conclusions and re-examine my reasoning.

But I am having trouble seeing how free speech is more important than educating students. I totally see how these institutions should educate students about free speech. I totally see how free speech generally supports the various missions and priorities of such institutions. But these institutions should primarily focus on the dissemination of information, primarily through educating students.

Am I missing something?

Chat GTP/LLM's Improvement Ceiling

Ten years ago, Siri was a quite impressive artificial intelligence/personal assistant. It was amazing. But it is no longer impressive or amazing. why not?

Well, the approach they took — the basic architecture and technology behind Siri (and the others in its cohort) — turns out to kinda be a dead end. It intrinsically has some issues that cannot be overcome, and those issues make improvement elsewhere difficult when it’s not impossible. That approach could only take us so far — in spite of the fact that our phones and their servers are astronomically faster than they were 10 years ago. This is not a hardware issue. It’s just in the basic architecture. And this idea that some technological approach is limited and will hit a ceiling is not unique to that kind of AI. It is true of every technology, be it hardware or conceptual. 

So, now we have the hardware to do this LLM (large language model) approach, as best known in the form of Chat GPT. A different approach. 

So much of the hype around these LLMs is really centered on their improvement and the idea that they will get much better quite quickly. Don’t think about the limitations today, we are told. Instead, we are supposed to imagine improvement that overcomes those limitations.

Why should we imagine that? Are they suggesting there really are not any real limitation? 

I don’t buy that for a second. I really don’t. No one should. Of course there are limitations! Of course there are diminishing returns! Of course this LLM approach has intrinsic strengths and intrinsic weaknesses. LLM is not the route to a all powerful supreme being.

So, what do we know about all LLM instances? They call them “hallucinations.” They do not understand or care about what is true. They make up shit. I’m sure that there are other issues, too. But this is the one that really gets to me.

Is this "hallucination" problem intrinsic to LLM? Is this a problem that can be overcome? At this point, the burden is on the hypesters to explain clearly why it is not. 

And let me ask you: would you ever trust an AI personal assistant who could not be trusted to give you true answers? Who might make up directions or make up books? Who cannot be trusted to be honest and accurate? Who does not even care about being accurate? What useful role could such an AI play in your life? 

LLMs are a neat trick. I am sure they are useful for a wide range of things. I might even find some uses — perhaps mostly professional, though not entirely — but I do not  see that this approach is going to get us to where so many people so fervently would like it it.

Yeah, there exists an improvement ceiling, and I think we already know what it is.

It Is Our Job to Teach

Recently, there has been a brouhaha at Stanford Law School about how students protested and disrupted the presentation of a speaker invited by the school’s chapter of The Federalist Society. In short, the invited speaker was provocative, students were provoked, some protested quietly in a non-disruptive fashion and others appeared to want to disrupt his presentation. It was disrupted. He was rude. They were rude — or at least disrespectful. 

The question at hand is whether universities — and particularly law schools — should respect norms of free speech that allow controversial — or even odious — speech, or whether there is some speech that is so disrespectful/harmful that it falls outside those protections.

(No, free speech does it include trying to drown out or disrupt others. Almost all free speech questions can be be address with the answer of “More speech,” not “Less speech.”)

The school’s Dean, Jenny S. Martinez, published a response to the contretemps, siding with the idea that the answer is more speech, disrupting the presentation was wrong, and that this is a particularly important value at universities. She explained that lawyers need to be able to listen to and respond intelligently to the arguments of others. Blah blah blah.

Now, I agree with her. I agree with almost everything she wrote. I agree with the blah blah blah parts, too.

And yet, I think she left something very very important out. I think that she is right. I think that I am right. But there is another perspective — not mine, but one that I want to understand — that she should have addressed. She barely waved a hand at it, and really didn’t even do that.

Imagine this perspective, though it is not mine:

Imagine that you are not part of a traditional elite, not part of a group from which the powerful elite have traditionally come. Imagine that you have lived your life surrounded by popular depictions and assumptions that people who (perhaps superficially?) resemble you are lazy, dumb, criminal, alcoholic, uneducated, criminal and/or otherwise marginal, disempowered, and objectionable. Imagine that you feel deeply defensive that the culture and the elite do not view you as a truly worthy equal, that perhaps you should be tolerated but that you will never actually belong. 

I don’t have those experiences. That is not my story. But it is the story of many students. 

I have no doubt that deep education requires intellectual — and perhaps emotional — risk-taking, requires openness and requires trust. Learning to think differently is difficult and even challenges one’s identity. It puts one’s old values at risk, requires one to look at the world through different lenses and thereby forces a different relationship to oneself and everything that one has ever known. That is a lot. 

So, that is a lot to ask of people who have a lifetime of experiences to tell them that the culture around them does not trust them, does not think well of folks like them. What ought an educational institution do to make their success at this endeavor of such deep learning more likely to succeed? What kinds of support should it supply? When should it introduce challenges? In what manner — and with what timing — should it start to remove those supports?

I have no doubt that educational institutions have a responsibility to educate, and not merely to demand proficiency. No doubt at all. That’s basically a tautological statement, that educational institutions should educate their students to instill in them their most important lessons.

And I agree with Dean Martinez that it is really important that lawyers be able to hear and respond thoughtfully to arguments that they believe are without merit and/or even made in bad faith. Listening to that garbage and not losing your shit? That, that’s a lawyer skill. It’s incredibly important. 

It appears to me that Stanford Law School’s answer to throw their students into the deep end. Perhaps to announce that being able to swim is really important, and then throw their students into the deep end — all without any real effort to educate them on this important skill. 

And without any consideration at all to how making students who were never taught to swim constantly nervous that they might be thrown into the deep pool…never considering at all what that might do to their necessary trust in their professors and the institution. 

Stanford law should teach this. All law schools should teach this, if it is such an important lawyer skill. Merely demanding proficiency is not enough. A one-time lecture on this is not enough, be it in writing or some other form. A one-time workshop is not enough. 

And Dean Martinez should be a lot more mindful of the difference between the behaviors and dispositions we expect of experts and the supports that educators need to provide to help our students to become experts. I agree with her goal. I agree with her views on freedom of speech on campus. But I think she sorely misunderstands the needs of her students and her responsibilities towards them.

The Most Dangerous Idea in Large-Scale Assessment?

Perhaps the most dangerous idea in large scale assessment is the idea that items assess standards, as opposed to assessing test takers. So much sloppiness and inappropriate inferences (and uses) of tests — which means they are not valid tests! – are caused by that enormous mistake.

If items directly assessed standards (or KSAs), then it would be possible to examine an item to see whether it is aligned to the standard without considering test takers. Item developers could just think about the ideal of an item and the ideal of an item that is aligned to a particular standard. They would not need to know, understand or think deeply about students.

But items actually assess students’ proficiencies. They are about test takers’ cognition — which is why they are called cognitive assessments. And test takers vary. They vary a lot. They vary in proficiency, in background, in experiences and in how they were taught. They vary in they command of other proficiencies. One might be a good reader and another a poor reader, making the word problem a very different challenge for two student with similar arithmetic ability. 

We say that valid item elicits evidence of the targeted cognition for the range of the typical test takers. We take that idea that test takers vary very seriously. There is a range of typical test takers for an items — a multi-dimensional range. 

Different test takers might find a different entry point into an item. They might have a different first thought. They might have a different initial guess. They might have a different next cognitive step, after that initial guess. They might consciously apply different strategies, or be differently aware of how they are getting to their answers. Because they have had different teachers who used different explanations or or different examples, two test takers can view the exact same item with different views of its novelty — and all that that implies about finding a path to a successful response. 

Test takers vary.

But little in item development training or item development practices dives into how test takers vary. There is little — and usually none — documentation about the different ways in which a standard is taught or the different kind of common mistakes and/or misunderstandings that potential test takers have with the standard. 

Instead, we too often rely upon one adult’s view of what the most likely reaction and cognitive path towards a solution might be — too often done without thinking and without the appropriate humility that there are many other potential reactions and paths.

In spite of all of this, people expect individualized score reports and people make individual inferences about test taker capabilities based on a test that assumes that test takers all react and think the same — an assumption that is logically at odds with the range of standarsd and the idea that different kids will get different items wrong. 

This idea that items assessment standards is particularly ill-suited to be paired with the expectation that these tests can deliver useful information about individual test takers. Even without that expectation, the frequent mismatch between real live test takers and the assumed aristotelean ideal of a test taker means that even the aggregate results rarely well reflect reality across the tested population. 


The limits of parental authority in America


Of late, I’ve been confronted far more about people claiming that parents should have vast amounts of authority over their children and their children’s education — in ways that not only disturb me, but that I think are actually un-American. 

In this country, we use democracy and elections to make community decisions about values — often disputed values — and how those values should be enacted and enforced within our communities. Yeah, democracy is a really poor way to do that. Recall that Churchill said, “Democracy is the worst form of government — except for all the others.”

So, we do not use the New Testament to make these decisions. We do not go to oracles or prophets. There is no supreme leader or wise person who decides these things for us. We trust the wisdom of the crowd, we use elections and our Constitution — the only founding document for our government — says quite clearly that authority to do this comes from The People, “We, the people.”

(I love the Declaration of Independence.  I read it in full most years. It’s got one of my favorite lines ever, “He has plundered our seas ravaged our Coasts burnt our Towns and destroyed the Lives of our people.” Man, that’s some good rhetoric! But that’s what it is, rhetoric. The Declaration might be a founding document for our country, but it is not a founding document for our government. There is only one founding document for our government, and that’s the US Constitution — which makes clear that the government’s moral authority stems from The People.)

So, America trust the people, collectively, to make decisions about values. At the same time, however, one of the truly distinguishing characteristics about this country is how we value the individual and individualism. While “liberty" and “freedom" mean many many things, one thing that they mean in this country is that individuals should be free to engage in their own decision-making about their own private lives and property. We prize stand-out excellence and self-efficacy in this country. 

There is a tension there. No doubt. Where do we draw the line between the zone of private individual freedom and the zone of the public that is controlled by the People (and their elected representatives)?

Well, that line has moved through the years, and not just in one direction. In some ways — in many ways — the private zone has expanded. But it some ways, it has shrunk. Well, not so much shrunk as better delineated that other people exist with equal moral and legal standing.

For example, at the nation’s founding, women had few rights. Very few rights. Women could not necessarily own property. When my parents were born a married woman could not get her own credit card. Oh, wait….let me correct that…when I was born, a married woman could not get a credit card in her own name. Married women did not have bodily autonomy at all, really — their husbands had virtually unconstrained control over their wife’s bodies. But we now recognize that women are people, too. Women — including married women — have all the same legal rights as men. So, the zone of private control by men has shrunk, but the zone of private control by women has expanded.

I’m not going to go into slavery and Dred Scott, but you know…

This takes us to children. They clearly should not have full control over all private decision making; they are just children. We used to set the age of majority at 21, and only just reduced it 18 for the Baby Boomers (and succeeding generations). Children have many many rights, but not real control over their lives.

That poses the question: Who should determine what is best for a child?

Now, very few parents are experts in child psychology, developmental psychology, nutrition, medicine, moral instruction, curriculum, etc.. Even those that are, well…lawyers make the worst clients, doctors the worst patients and virtually every shrink’s kid is all kind of messed up. When our emotions and sense of identity get more involved, we are often unlikely to make the best decisions. So, why should we trust that parents will always made the best decisions for kids?

Sure, we hope hope that most of the time they will make the best decisions for kids. They are usually more invested in those children than anyone else is. Most parents want what is best for the kids, at least most of the time, or at least they think they do. But all parents are fallible people. Some of them are not great people. Some of them are sometimes not great people. As a matter of convince, we have to trust them most of the time. 

But should we trust them all of the time? Why should we do that?

Are children more like some piece of private property which owners have huge amounts of control over? Are they more like beloved family members, like wives? Should we trust some adult (or two) to make all decisions about a child — like husbands used to be able to do for their wives? Or, should the community be the ultimate arbiter? Should we have safety net for children, because we — as a community — value our children’s well-being that much.

Back in the day, people had to use animal abuse laws to protect children because the law did not conceive of the idea that children would have rights of protection against mistreatment by parents, just like the law did not protect women from their husbands. 

Honestly, I do not trust that parents should be the final an ultimate arbiters of what is good for their children — just as I do not trust husbands to be the final arbiters of what is good for their wives.

Rather, as an American, I trust We, The People. 

Recognizing What We Optimize For

Life is about tradeoffs. Work is about tradeoffs. Work-life balance is about tradeoffs.

Nothing is perfect in every possible way. Instead, tradeoffs grounded in external pressure, priorities and/or values aim towards some sort of acceptable balance of different factors and criteria. These tradeoffs really are about values.

There are many different values that we might bring to large scale assessment development, including

  • trust of educators.

  • information for parents.

  • feedback for students.

  • information for policy makers.

  • reliability.

  • validity.

  • testing/seat time.

  • cost of development.

  • operational costs.

  • scoring costs.

And no doubt many many more. 

Some of these potential values have clearly been given lower priority, and some higher priority. 

Now, The Standards for Educational and Psychological Testing say (in the second sentence of the very first chapter) that validity is "the most fundamental consideration in developing tests and evaluating tests.” I quote this line all the time. Validity is the alpha and omega of assessment — or it should be.

I am concerned — I have been concerned for a long long time — that validity is not given the priority that our bible says it should be. I am concerned that we optimize for reliability, at the expense of validity. I am concerned that we optimize for time, at the expense of validity. I am concerned that we optimize for cost, at the expense of validity.

Reliance on templated items and various forms of automation can deliver less expensive tests. But they will be worth less because of the increasing costs to validity. If the goal of these new efforts and technologies is to invest those cost savings and time saving back into validity, perhaps they can be worth it. But if the vision is to merely save time and money, they are just another drain on validity, when our tests are far from having the validity to spare. 

What Cognitive Complexity Is Not

As we often find, understanding what something is can be aided by thinking about what it is not.

  • First, cognitive complexity is not difficulty. Memorization of the names of all the US presidents is not cognitively complex, though getting that right is difficult — and notably more difficult that it was when I had to do it in sixth grade. Low cognitive complexity does not assure low difficulty. While higher cognitive complexity tasks often are more difficult, practice can lower difficulty. Furthermore, different people vary in how easy or difficult they find individual tasks. Two people can differ in which of two tasks they find easier, regardless of their relative cognitive complexity. 

  • Second, cognitive complexity is not grade level. The more advanced knowledge, skills and understandings found in later grades are not necessarily higher in cognitive complexity than those found in lower grades. In fact, some of the most advanced forms of knowledge contain large amounts of quite specialized memorized knowledge (e.g., drug dosing information). In fact, most experts have quite a bit of essentially memorized knowledge that is particular to their field — in addition to advanced understandings of complex ideas that they have expertise to apply with greater cognitive complexity and which connect and explain the significance and importance of that lower cognitive complexity stuff. 

  • Third, cognitive complexity is not importance. Low cognitive complexity knowledge can be important or unimportant. Memorizing each of the facts below is low cognitive complexity, but some of them are more important than others.

Content AreaLow ImportanceHigh Importance
ELAThe name of Romeo’s first obsession in Romeo & JulietThe correct usages of their, there and they're
Math√3 = 1.7337 x 8 = 56
ScienceThe atomic mass of iron Carbon dioxide acts as a greenhouse gas
Social StudiesIdentity of James Polk’s vice-presidentExact wording of the 2nd Amendment to the US Constitution

Similarly, high cognitive complexity skills can also vary in importance. Now, the square root of 3 (to four significant figures) is 1.732 — and not 1.733, as listed above – and that fact really does matter in some contexts. But memorizing that simply is not important.

  • Fourth, cognitive complexity is not the number of steps or amount of time a task takes. Memorizing the names of our 46 presidents might take a while to accomplish and certainly has 46 steps. That does not make it particularly cognitively complex. More of the same does not make the task more cognitively complex, even if it makes the task more exhausting or difficult. 

  • Fifth, cognitive complexity is not the context or scope of a task. Low cognitive complexity tasks that are done as part of a larger project are not higher complexity for that scale of that larger task. For example, proof reading a 10 page paper is no more cognitive complex that proofreading a 2 page paper. The larger paper might be more complex and producing it might have been a more cognitively complex task, but the cognitive complexity of proofreading is not changed for the complexity of the larger task — though difficulty, time and number of steps well could have changed. 

In fact, tasks of greater cognitive complexity often include or rely on subtasks for lower cognitive complexity. For example, the simple word recognition of sight reading is part of most school work — regardless of the overall cognitive complexity of that school work. But the cognitive complexity of the larger task does not make the subtasks more cognitively complex and the cognitive complexity of the larger tasks is not determined simply by adding up the cognitive complexity of all the subtasks. 

Is ELA One Construct or Two?

In our recent work on cognitive complexity, we came to the question of whether the ELA construct is really two constructs or is one united construct? Norman Webb’s Depth of Knowledge (wDOK) breaks out the reading construct from the writing construct, and some tests report separate writing scores from reading scores. On the other hand, other content areas report just a single score.

We were — and still are — unsure how to proceed.

The reading process often feels different and distinct from the writing process. The 3R’s lists them separately. But we know that no one can learn to write without reading — and cannot write well without doing a lot of reading.

On the other hand, if you look at the the first Anchor Standards for both the Common Core State Standards for reading and for writing, you see that CCSS links reading and writing from the beginning.

CCSS-L’s first Anchor Standard for Reading:

Read closely to determine what the text says explicitly and to make logical inferences from it; cite specific textual evidence when writing or speaking to support conclusions drawn from the text. [emphasis added]

CCSS-L’s first Anchor Standard for Writing:

Write arguments to support claims in an analysis of substantive topics or texts, using valid reasoning and relevant and sufficient evidence. [emphasis added]

Reading and writing are incredibly intertwined — especially in CCSS with its emphasis on writing about text. Beyond the route to skill acquisition I mentioned above, reading and writing remain incredibly intertwined even through the most advanced application of the various reading and writing skills — and are even usually particularly intertwined in the most advanced applications of these skills.

In fact, we believe that assessment of reading skills — certainly at the middle and upper grades — is best done through writing. Perhaps the most basic sorts of reading comprehension (e.g., literal or surface meaning of a text that is disconnected from broader contexts) can be well assessed without authentic writing. However, even at the middle grades, real display of reading skills — especially the most important grade appropriate reading skills — occurs when test takers’ understanding of the text is wielded in their writing.

While some of CCSS’s writing standards do not require writing about text, generally Common Core’s writing standards presume that most academic writing would be about text — which means writing about reading.

Reading is best assessed through writing and writing is quite often supposed to be about reading.

Now, we still think that these two strongly linked constructs are, nonetheless, two (not entirely distinct) constructs. They are linked through reasoning in many tasks, but there are some cognitive processes that are particular to each of them. However, eliciting high quality evidence of middle and upper grade reading skills seems quite unlikely without writing tasks. And Common Core’s standards strong preference for writing about reading meaning making eliciting high quality evidence about writing quite unlikely without reading tasks.

Which, of course, leaves us quite troubled.

What is Cognitive Complexity (in Large Scale Standardized Assessment)?


Though most people do not know this, standardized test developers generally examine items for "cognitive complexity.” This is one of many ways that they (are supposed to) ensure the quality of items on tests. Cognitive complexity is not the same thing as difficulty, however. For example, consider the question, “With whom was Romeo obsessed before he met Juliet at the party?” This is difficult question, but it is not a cognitively complex one. Rather, it is just a memorized fact that you know or do not. Cognitive complexity is something different than item difficulty.

Many of us consider cognitive complexity to a bea type of alignment. That is, items are supposed to measure specific skills found in the standards purportedly being assessed, and are examined for that. This additional layer of examination considers whether the cognitive complexity of each item is appropriate for the particular standard the item is intended to measure. Another way to think of cognitive complexity reviews is that the goal is to ensure that the range of cognitive complexity of assessments matches the range found in the standards, even if the match is not taken down to the individual item-standard pairings.

What everyone agree on is that large scale standardized assessments should not be limited to items of low cognitive complexity. In my view, that is one version of dumbing down tests, and obviously it should be avoided. 

So, what is cognitive complexity? Well, on this point there is not a huge amount of thoughtful agreement. But generally, higher order thinking skills and problem solving skills are thought to be examples of greater cognitive complexity and…umm….well, things like memorization are thought to be lower cognitive complexity. But that’s not really a definition is it?

The problem is that there are different ways to recognize or categorize cognitive complexity, and they each highlight particular aspects of this poorly defined idea. 

For example, some people look to Bloom’s Taxonomy (or Revised Bloom’s Taxonomy, RBT). They suggest that assessments should elicit cognition across a range of BRT categories. Now, Bloom’s (RBT or original recipe) is not really much of a hierarchy, so it is not well amenable to the idea of greater cognitive complexity. However, it can be useful to highlight the breadth of different kinds of cognition that a whole test might elicit. RBT acknowledges that the different categories within Bloom’s each have a range of levels, but does not offer a way to compare them across original categories. Nonetheless, because of how commonly Bloom’s is used in teacher training (i..e, both pre-service and in-service), it has the advantage of feeling familiar to many educators. So, if you are comfortable with Bloom’s, it is one view of cognitive complexity.

The most common typology used by developers of large scale assessments is Depth of Knowledge (DOK), a system developed by Norman Webb over 20 years ago — that’s far more recently than Bloom’s. Because DOK is so common, our own efforts to clarify the meaning of cognitive complexity has focused on it. Our Revised DOK (rDOK) is an attempt to preserve as much of Webb’s original DOK (wDOK) as possible, while addressing some its intrinsic shortcomings. Generally, both version of DOK focus on the difference between the kinds of skills that are applied more automatically and the kind of skills that require more careful thought and deliberation when applied. Our efforts with rDOK are primarily focused on with how poorly wDOK has been used, in practice.

Examination of cognitive complexity should hold test developers’ feet to the fire. It should force them to struggle with the constraints on standardized tests as they try to include that more cognitively complex cognition. It should drive them to be more innovative, as it highlights past shortcomings. It should help to make the case that items need to be better, available item types need to be richer and assessment of what standards describe requires real resources to score (and report).

Cognitive complexity should not be so undermined that it becomes just be a hoop to jump through resentfully. Norman Webb wanted his DOK to highlight differences between the kind of rich and thoughtful work in which students engage when doing their authentic school work and the simpler thinking that large scale assessments are so often limited to. We think that he was right.

Do Items Have a Central Cognitive Complexity?

Cognitive complexity can be a powerful lens through which to examine items. It can highlight gaps between the richness and depth of instruction, of standards, of classroom learning and what appears on assessment. Unfortunately, the idea of cognitive complexity is too often just waved at, perhaps resented, and only cursory effort made to consider it. More thoughtful use and consideration of cognitive complexity can contribute to the development of assessments that educators feel better reflect their efforts and the learning goals for students. In other words, tests that they feel have greater (facial) validity.

Perhaps the fundamental mistake in the assessment context in thinking about cognitive complexity – very much like thinking about other aspects of alignment – is assuming (or insisting) that items have some fixed or inherent complexity and should be evaluated as such. The RTD central tenet is that valid items elicit evidence of the targeted cognition for the range of typical test takers. RTD is deeply rooted in the idea that assessment is about cognition and that cognition varies across the range of typical test takers. RTD is clear that to understand cognitive complexity, one must start with the cognitive paths of  a variety test takers.

So, why is the idea of some innate item complexity wrong? Well, it depends on what is meant by that claim.

 

If item complexity refers to the complexity of the result or product, there are many reasons.

  1. The actual final work product on most items on most large scale standardized tests is merely the selection of one offered option among a small handful. A is not a complex product. C is not a complex product.

  2. If one argues that it’s not the label on the answer option but rather than contents of the answer option, one has hardly made any progress. Those answers are almost always quite short and straightforward. They certainly lack the range of complexity and depth of answers that might be offered in a classroom – be it in writing or orally. This leaves very little range of complexity for standardized test items.

  3. Furthermore, this approach would suggest that any math item whose answer is a number is not very cognitively complex and that math problems are generally less cognitively complex that even fairly simple ELA items. We do not believe any of that.

  4. Some argue that the final product is evidence of the complexity of cognition of the test taker. Frankly, whether they realize it or not, this simply concedes that cognitive complexity is a trait of the cognitive path (see below) and capitulates on the claim that complexity is in the final product.

Clearly, this idea simply is not productive or useful in the context of large scale standardized assessment. This means that item complexity must somehow be about the process through which test takers arrive at their final answers. And, again, we ask whether this idea is compatible with the idea that items have some important fixed innate complexity.

Put another way, is there some singular decisive process or path through an item to a solution/response that should be focus of cognitive complexity classification decisions? Is that idea productive as a general approach?

One simply must acknowledge that all items can be responded to with multiple cognitive paths. All. Test takers can respond with nonsense and/or can just guess, rather than working through the problem. Classroom teachers know well that many students respond to stress by losing confidence that they can work through a problem and revert to guessing. One might posit that the singular decisive path is the one that yields the correct response, but guesses can be correct.

Clearly, there exist multiple potential cognitive paths, so the question is really Which cognitive path is the singular decision process or path through an item to a solution/response that should be focus of cognitive complexity classification decisions? If, of course, such a thing even exists.

  1. It seems obvious to us that if you had to choose one cognitive path as the most important one, it would be the one that most test takers use – or at least that a plurality of test takers use. But that is an empirical question that can only be determined through massive amount of quite difficult data collection (i.e., ideally through the development of mind reading technology that be used at large scale to best assure that the sample whose minds are read are appropriately representative of the testing population). But even were that research possible, this would not simply be a feature of the item. Different testing populations may choose different cognitive paths. Moreover, if standards, curriculum or textbooks change in a state or district, students may be influenced to select different paths. There is no singular decisive path, here.

  2. Perhaps the shortest and most direct path to the answer is the singular decisive path to examine for cognitive complexity. Well, one would clearly have to put aside the actually shortest paths. Guessing is renders everything low cognitive complexity. Having already seen the problem and simply remembering the answer is not guessing or cheating, and is also very short cognitive path. But that is clear not the singular decisive path. Very many math problems on standardized tests can be answered through backsolving, because of the modality of selected response items. Is that backsolving path the singular decisive path of these item for these purposes? If proponents of this approach would accept that and would call for serious efforts to find the shortest path to a correct response (i.e., still excluding guessing and already knowing the answer), we could almost respect that. The problem is that that still leaves questions of how much experience and prior knowledge to disregard. Some test takers might not have seen this exact question, but have seen the exact same type of question and therefore can more quickly cut to the answer. Which of these paths does one exclude from consideration and which are candidates for the singular decisive path? And once again, issues of variation across test takers and their classroom – and other – experiences must be considered. There is no decisive answer here.

  3. Perhaps the singular decisive path is the one that is built around the KSAs of the desired aligned standard? That certainly would be convenient. Unfortunately, this idea quickly falls apart. In order for this approach to have any merit, one would have to assume that items are correctly aligned and that simply begs the question (new school) of what alignment means. It certainly forestalls the reality that items often have multiple paths to a response and begs the question (old school) of which path one should base such determinations upon. Otherwise, it is tantamount to saying, “Well, if they do the item the way we want them to – which might not be the easiest, most obvious or most appealing path for a test taker to taker – then this is how complex the item is.” In other words, cognitive complexity is a product of the path that content development professionals (CDPs) would like test takers to take. However, just as different test takers might see different paths as preferable, so might different CDPs. This simply becomes an arbitrary decision.

  4. The previous option is just one way to get to a very common problematic view: the singular decisive path is the one that I would take, that I imagine that I would have taken and/or that I imagine that most (or typical) test takers would take. All of those are projections of the CDP’s own thinking, habits and/or preferences.  Because different CDPs can come to different answers here, there certainly is no singular decisive path here to base determination of cognitive complexity upon.

Once you acknowledge that there are multiple cognitive paths through an item, there simply is no way to identify one of them as the singular and decisive path. If items are to have some fixed innate complexity that should be the starting point and focus of cognitive complexity recognition, it is not found in the cognitive paths of test takers

There is one more notable approach to identifying cognitive complexity – one that is actually quite commonly used. This approach says that skills and standards themselves can be classified by cognitive complexity. Unfortunately, this approach also collapses under the weight of reality and simple practical considerations.

  1. The simplest application of this approach suggests that more advanced skills are of greater cognitive complexity. But this simply turns cognitive complexity into a recapitulation of grade level. That cannot be right, as it renders it useless for its duplicativeness.

  2. A second application might consider the sophistication of the application of the skill or standard, but this generally becomes a standard-specific recapitulation of the grade level. Alternatively, it might be about the proficiency with which the has been (or must be) applied. But – again – proficiency is supposed to be a different construct. (In fact, IRT puts item difficulty on the same scale as test taker proficiency/latent ability. Cognitive complexity needs to be something else, to be useful.)

  3. A third application of this approach might consider the difficulty of the skill or standard, but difficulty is a product of instruction, practice and preparation. Different teachers can emphasize different skills or standards and different curricula can set of better instructional paths towards some standards over others. Again, this is not simply a function of the Aristotelian ideal of the skill or standard. That is, difficulty is a population-specific result. Furthermore, we already collect empirical measures of item difficulty and this approach is largely duplicative of that – or at least is largely duplicative for well aligned items.

Where does this too long discussion leave us?

  • There is no innate item complexity to be found in the final product.

  • Test takers always have multiple paths to a response, and most every item has multiple paths to a correct response.

  • There is not particular singular decisive path that one can use determine item complexity

  • Classification of individual standards or skills by cognitive complexity either falls apart for redundancy with other measures or is population- (and their educational experience) dependent.

  • Every route out of this conundrum of how to determine a fixed or innate item complexity resorts to projection and preferences of the CDP (or other human judge of its complexity).

Which means that such a thing does not exist. Which means, cognitive complexity must be grounded in something else. RTD says that items often have a range of cognitive complexity because they prompt a range of typical test takers to take a range of cognitive paths to their responses.

So, what does RTD offer for cognitive complexity determination? Well, that’s rDOK.

ChatGPT Results May Be Plausible, But They Are Not Credible

ChatGTP — the new artificial intelligence chatbot that is all the rage — is amazing. It writes plausible text that seems fairly informed about the world and any number of topics. It is so impressive that there are all kinds of people saying that this sort of approach could replace search (e.g., google) when looking for information.

Wow is that a bad idea. An incredibly bad idea. Just dumb. Really really really dumb.

First, and least importantly, I find it shocking how impressed so many people are at text written at a level of a very talented 8th grader. That is, the writing of a really smart 14 year old.

Second, and far more importantly, it appears that people do not understand the role of plausibility and accuracy in what ChatGPT is doing, nor their implications for the kinds of things that they might search for.

ChatGPT does not care about accuracy at all. That’s not part of its programing. That is not how it was designed and not how its designers want it to be evaluated. If you care about accuracy, you need to steer clear of such things.

Instead, ChatGPT cares much more about plausibility. Obviously, it’s just a program, so it does not actually care. And I’m not sure that its designers would use that term, “plausibility.” But I am fairly certainly that they would concede that that is part of the goal. ChatGPT is built upon some vast and broad corpus of text and generates responses that fit the patterns that the machine learning AI has found in that vast and broad corpus. That is, it generates text that rather looks like what it has already found, building on what the user types in in their side of the chat.

So, I gave it another sort of test run. I asked it about my work, about the differences between validity and reliability and how they apply in the development of items for standardized tests. Not surprisingly, it started with a bunch of very generic ideas about reliability and validity. It focused more on reliability. Through the chat it started talking about classroom assessment, and I tried to redirect it again.

Eventually, it recommended some books — or were they articles? Well, either way, they were focused on classroom assessment. So, I asked about standardized test development. Most of what it recommended was more about psychometrics. But one of the recommendations seemed on point! It was a book (or article?) that I’d never heard of.

"Constructing Effective Test Items" by Susan M. Brookhart

Fascinating. Let me see if I can find that. I’d love to know more about it.

  • Google found nothing. (Well, it found two pdfs that do not mention such a piece, those all those words do appear. There’s a reference to another piece by SM Brookhart. Well, co-authored by SM Brookhart.)

Google reports no “great matches.”

  • Google scholar found nothing. (Well, it found the same thing. Cites it as by TP Hall, when that is actually just the location (i.e., Tate Page Hall 334) of the meeting for which the PDF is a copy of the agenda).

Google Scholar reports one results, but it is just a a meeting agenda.

  • Google books did not find it, either.

No results found for "Constructing Effective Test Items" by Susan M. Brookhart

Google Books reports no results

  • Bing has offered more results, but not to the article. Rather, they are to Dr. Brookhart’s website, ASCD page, ResearchGate page, etc. There’s also a link to something by a Tamanna Kalim over at Bangladesh Education Article. Now, it is not the worst article in the world, but it is not by Dr. Brookhart and is relatively generic.

This means that ChatGPT just made up this book (or article) that it claims exists. Just made it up. Sure, it is plausible that she would write such a thing. But it does not exist. I am not saying that it is a bad book (or article). I am not saying that it is irrelevant or poorly named. I am sayin that it does not exist.

Even when ChatGPT provides references, it cannot be trusted. It just makes them up — because that it what it is designed to do. Present something plausible, something that like what is out there — without any regard to accuracy.

Do not trust ChatGPT. It is just not credible.

Addendum

Five days later, I have tried again. I had a slightly different conversation — one a little more focused on challenges of alignment and false negative inferences (i.e., Type II errors). ChatGPT was quite confident.

Yes, if you have two test questions that include the same two skills or concepts, you can designate one of the questions as being focused on the first skill and the other as being focused on the second skill. This can be an effective way to assess both skills within the same test, as long as the questions are well-constructed and aligned with the assessment objectives. However, it is important to ensure that the two questions are not too similar, and that they are measuring the skills or concepts in different ways. For example, if both questions focus on the same mathematical concept but one uses a more advanced problem-solving approach than the other, it may be more appropriate to consider the first question as the primary focus and the second question as a secondary focus. This can help ensure that the test provides a balanced and comprehensive assessment of the test taker's knowledge and abilities.

We had gone back and forth a bit on this point, and so finally I asked, “Where can I learn more about this?” ChaptGPT’s first bulleted response was:

Reading books or articles on the subject, such as "Test Development: A Beginner's Guide" by Ruth E. Newton and David E. Wright, or "Standardized Testing: A Practical Approach" by David M. Nichols and Richard C. Sweetland.

These two books do not exist.

2nd Addendum

This week’s Slate Money begins with a segment on ChatGPT, and Emily Peck quickly makes reference to this issue that I had shared with her.

The Ultimate MisNAEPery: Confirmation Bias

This week’s NAEP results have been deeply deeply disturbing. They should leave all of us with incredibly deep questions about education research and the education research and policy community. We have witnessed a new form of misNAEPery that should cast deep doubt on things that we have long taken for granted as true.

MisNAEPery is the misuse of NAEP data — results from the National Assessment of Educational Progress, known as “The Nation’s Report Card.” Please know that NAEP is a very different set of standardized tests. Students do not specifically prep for it and it has no stakes attached to it. Results are not published for individual students, teachers or schools. In fact, that is not even possible because it uses something called “matrix sampling.” This means that different students have different questions on their forms, and then all the data is combined into aggregate sores for entire states. ENTIRE STATES! (There is a also reporting on 27 particular districts which have volunteered to take part in the TADU (i.e., Trial Urban District Assessment), but because those districts are smaller than their states, they have larger margins of error.)

This approach allows NAEP to address countless objections to most standardized tests. Freedom from having to compile results for individual schools, teachers or districts allows check for and account for issues that other assessments cannot even dream of. Short tests that nonetheless address large content domains, care around item interaction effects and and and…it’s the gold standard.

The Question of the Decade

The educational policy and practice question of this young decade is about the impact of the pandemic on students, learning and teaching. The most obvious and contentious aspect of this question is the contribution of school building closures – and consequent reliance on remote (i.e., Zoom) schooling – on “learning loss” (i.e., the unfortunate name given to the idea that students did not learn as much during the pandemic as they would have otherwise, that they did not progress as much as they would have if there had not been a pandemic).

It is odd that this is such a contentious issue, because most everyone has a stake in believing that remote schooling is inferior to in-person schools. Those who wish to attack teachers unions, educational bureaucracy and even teachers blame them for school building closures and the resulting learning loss. (Of course, they conveniently ignore the fact that other schools that are more responsive to market pressures — such as private and charter schools — also closed the buildings during the pandemic.) Those who think that the New 3 R’s (i.e., rigor, relationships and relevance) are vital to success with the old 3 R’s (reading, ‘riting and ‘rithmatic), that teaching is more than just lecturing and is instead about meeting students where they are and meeting their needs…well, we think that time with teachers is valuable. We think that teachers matter. We do want to think schooling can help students beyond their own cognitive developmental path and the impact of various out-of-school factors.

We should all want to see that lost time in school with teachers had a cost. Even if people disagree about whether it was necessary or worth it to pay that cost, virtually everyone expects that the new NAEP results to give us a sense of what the cost was.

This is because states differed enormously in how long school buildings were closed. Chalkbeat’s coverage of the new NAEP data shows this, such as Texas schools being open 88.7% of the time and California’s schools being open just 6.9% of the time. (Go read that coverage. It’s surprisingly good. And note that while Matt Barnum wrote the story, the graphics — like the one I have copied below below – are by Thomas Wilburn. The originals are interactive.)

The Unexpected

The problem is that NAEP does not show that states whose schools were open to more in-person learning had markedly stronger results. It just doesn’t. For example, California’s results slipped back less in 8th grade reading and math than Texas’s and exactly the same amount in 4th grade reading and math. New York (14.2% in-person instruction) slipped less than Texas in 8th grade and more than Texas in 4th grade. Florida (96.8% in-person) was also worse than Texas in 8th grade and only better than Texas in 4th grade reading. Again, note that this is not about absolute level of performances, but rather is about how much the state’s students slid back during the pandemic, from one cohort to another. Even just among these four largest states, we do not see the expected results.

Taking all the states’ results into account, we do not see the expected patterns. In some cases, we see far weaker versions. In some cases, we do not see anything like what we expected. What literally everyone expected. (And I mean literally literally. 100%. Absolutely everyone. Not a single person predicted what this data shows. Not one.)

Again go read Chalkbeat’s coverage. And if you want more, there’s EdWeek’s coverage.

The Deeper Problem

No one assaults NAEP as being bad data. It is the gold standard. Those of us who decry low quality many state assessments and decry bad analysis of quantitative data point to NAEP’s quality. Those with more confidence in quantitive assessment results look to NAEP as the benchmark.

But suddenly, in light of these shocking results, people are making excuses. Because the 2022 NAEP results do not show what everyone expected, people are…behaving differently.

The real value in data and research is not in finding supports for what you already believe. The real value is helping you to figure out what is true. For those with intellectual integrity, it is more important to learn than it is to convince others. It is more important to be right tomorrow than appear right along.

NAEP is telling us that we were wrong. That all of us were. Now, from a bayesian perspective, the strength of our prior belief should make us less open to countervailing evidence. It should. That is OK. But the strength of NAEP as the highest quality evidence should make us question any prior belief. That is what NAEP is for. That is how everyone who knows about NAEP views it.

So, I have to ask, if NAEP is does not shed valuable light on this question, what is it ever useful for? If this is not the absolute best case use of NAEP data, then what is? And if NAEP is not useful, how is any achievement data ever useful, or any on-demand evaluation of student knowledge, skills and/abilities — be it standardized or not?

Or, if NAEP remains credible, what does that imply about the value and nature of teaching and the classroom? What does this say about natural cognitive development, as opposed to intentional learning? What does this say about the potential for additional use of remote schooling and how we might reshape childcare structures in this country?

Integrity in the Future

What is not acceptable is to simply ignore this year’s NAEP results.

I need to re-evaluate my confidence in NAEP more broadly. That’s my next step. I am comfortable saying that I would rather find problems with NAEP than have to devalue teachers and the new 3 R’s. I’ve not really dove into the mechanics and methodology in NAEP in a long time. And I’ve never subjected NAEP items to RTD’s level of item examination. At the same time, I also need to rethink the potential of…oh my god it hurts to type this…cyber charter schools. Oh the pain! The pain!. But I was wrong about something, either NAEP or nature of teaching and the importance of teachers.

As I look around this week, I do not see this kind of soul searching. I do not see acknowledgements of the importance of this moment for education researchers, educational policy practitioners, in school educators and assessment experts.

That worries me.

Vertical Scales and Unexamined Assumptions about Unidimensionality

Just this week, Chalkbeat’s Matt Barnum asked about the meaning of NAEP’s apparent use of a single scale to report all of its test result. This topic — vertical scaling — reveals problems with vertical scaling. This example makes easy to see.

What is a Vertical Scale?

While the same set of grades are reused across grades (e.g., either the A-F system or the 100-point scale), this is not always done with reporting on standardized tests. Though people understand that an a student who just earned a B+ in 10th grade knows much more than a student who just earned an A- in 5th grade, some people want to highlight that there is this longer continuum across the grades. They even want to compare performance of students (or collections of students) across grades. That is where vertical scaling comes in.

With vertical scaling, we do not have to reset our understanding of the reporting scale for each grade. Instead, the scale just keeps going up. So, the average 2nd grader might score in the 140’s, and average 3rd grader in the 160’s, an average 4th grader in the 190’s, and so on and so on all the way up to the average 11th grader in 620’s. It’s a VERY long scale, with lots of overlap between grades.

There are generally defensible techniques for doing this — though they rely on problematic assumptions. Vertical scales are very important to support various policy goals and evaluation approaches. More simply, though, they support more kinds of comparisons — even comparisons of how much a single child learned one year vs. another year or how much two children in two different grades learned.

The key to vertical scaling is the use of anchor items. Anchor items allow the linking of two tests — across multiple forms of a tests, across different years, across different grades. By reusing a handful of items on each test, they can act as a kind of splice that enable comparisons across tests. That is, comparisons of items across tests. So, if they quantify the performances of test takers on those anchor items on each test, they can use them as a common baseline to link performances across all the items on each test to each other — regardless of which test the items are on.

In the context of vertical linking, they take some of the harder items on the lower test and some of the easier items on the higher test and make sure they are all on both tests. (They do not have to be the easier/harder items, but I think the logic works better when they are.) These shared anchor items provide the psychometric bridge to create a single reporting scale for both tests. Do that with all the gaps between each grade and you can get single scales for the entire span of K-12 education.

Unfortunately, I don’t buy it.

Unidimensionality’s Basic Falsehood

Unidimensionality is the idea that whatever this is that we are measuring, we really are measuring just one thing. That is, if this is a math test, so we are measuring math. We can basically treat each item as contributing equally to the score because each item measures one unit of math. We can summarize performance with a single score on this 3rd grade math test because 3rd grade math is just this single homogenous thing.

The problem is that 3rd grade math is not a single homogenous thing. 3rd grade math is MANY things. Common Core has 5 different domains in 3rd grade math, comprise of 25 different Math Content Standard. If one counts all the individually broken down subparts of CCSS’s 3rd grade math standard, you get 33. Of course, there are also the eight Standards for Mathematical Practice.

How can we report a 3rd grade math as a single score when it has all those different parts? We know the parts are different because the content experts tell us that. We know that different kids have trouble with different parts. We know that they are different grain sizes — even just between the Standards for Mathematical Practice and the Content Standards.

The Reporting Compromise and Its Unexamined Assumption

There is such utility in reporting performance unidimensionally, we simply have to find a compromise. Now, this is a compromise that we have all long been comfortable with. After all, we accept report cards that give students a single grade for math, a single grade for science, and a single grade for each course they take. We accept that in test reporting as well.

The compromise is acknowledge that there are different standards, so the reported score is a composite score. 4 parts this domain, 3 parts that domain, 6 parts this other domain. It is like a teacher who says that grades in their class are made of up of:

  • 30% homework

  • 30% projects

  • 20% tests

  • 20% class participation

Because standardized test reporting impacts so many thousands or millions of students, those composites should be designed very carefully. They should properly weigh the different elements of the entire content domain because different weightings will yield different results. Different weightings will encourage teachers to focus on different parts of the curriculum. Different weightings will favor or disfavor different students, different teachers, different schools and different instructional approaches.

Thank god, the developers and sponsors of standardized tests know that the weightings matter. They try to be thoughtful about them. However, they may not be thoughtful enough. They may be too driven by convenience and too accommodating of the limitations on the tests (and of how those limitations drive the weightings). But no one takes designing a test blueprint lightly. Nonetheless, there is always something arbitrary about the weightings, as there is no definitively correct answer and there are so many factors that influence blueprint design that have nothing to do with the content domain itself (e.g.., item type limitations, scoring budgets, seat time limitations, etc.).

Unfortunately, the real unexamined assumption is that the items themselves actually measure what they purported measure. There is very little work on making sure that items do not individually produce false positive or false negative results. That is, whether students can solve them without using the targeted standard or might fail to solve them for reasons other than lack of proficiency with the targeted standard.

This lack of care with item validity (i.e., items that elicit evidence of the targeted cognition for the range of typical test takers) undermines the thoughtful work of designing the composite that a test’s blueprint promises. If the items don’t measure what they purport to measure, the elements of the composite are not properly weighted. Some elements might not even be represented at all!

This leads to scores who meanings are uninterpretable — unless we just accept that the blueprint and details of the composite’s weights do not really matter. After all, 3rd grade math really is just one thing, right?

Problematicly Assuming Unidimensionality for Vertical Scaling

Vertical scaling necessarily assumes unidimensionality. It has to. Even if the composite was crafted incredibly wisely and the items each actually was perfectly valid, successive grades would have different composites. Some subdomains are more important in 3rd grade math and others more important in 4th grade math. Eventually, lower level content is taken for granted so that higher level content can be focused on. For example, while arithmetic is always important, the importance of interger addition on tests fades as more advanced arithmetic is covered and eventually the importance of arithmetic fades and algebra and other more advanced topics gain focus.

  • If the composite changes, what does it even mean to link scores between them?

  • If we acknowledge that the summative score is made up of different subdomains, how many anchor items do we need to link the subdomains across grades?

  • If a new subdomain appears at some grade, what does it do to the very idea of linking scores across grades?

The only way to resolve these (and other) issues is to hand wave them away and assume unidimensionality.

Back to NAEP’s (facially) Vertical Scale

The National Assessment of Educational Progress — “the nation’s report card”!! — makes no such claim. It does not claim to be a vertical scale. It does not claim that 4th grade scores can be compared to 8th or 12th grade scores. It does not claims a two-point increase in 8th grade means the same thing as a 2-point increase in 4th grade. It does not claim that high enough performance on the 8th grade test would mean more advanced average proficiency than a very low performance on the 12th grade test.

Not at all. it is not a vertical scale. But the three grades are reported in a way that looks like it might be a vertical scale.

But here is how we know it could never be a vertical scale: You cannot anchor items between two levels so far apart. If the items on the 4th and the 8th grades test each actually represent appropriate grade-level standards, we should not expect that any decent number of 8th graders would get the 4th grades items incorrect. Nor should we expect sufficient 4th graders to get any 8th grade items correct. Certainly not enough to splice the two test’s scales together.

This is not about how smart the 4th graders are. Rather, it is that they simply have not been exposed to the 8th grade content, yet. Any signal (i.e., information about 8th grade math skills) in that data would be overwhelmed by noise (e.g., test taking savvy). Similarly, 8th graders who get 4th grade items incorrect might be far more likely do so because they misread the item, rushed or were sloppy than because they lack the content expertise. Again, the noise of construct-irrelevant factors would overwhelm any signal of some 8th graders’ lack of proficiency with 4th grade content.

You simply cannot link tests that are so far apart because you cannot ask these students the same kinds of questions.

The Point?

Well, I see two important takeaways.

First, I find Matt’s question disturbing because he works for a very good education-specific news site and his beat includes both education policy and education research. Among scholars I respect, he is well thought of. No question, he knows a lot for an education journalist.

And yet, even Matt did not understand this. I’ve no idea how many times he has reported on NAEP scores and use of testing has been one of the dominant themes in education policy for decades. If Matt does not understand this, then what does that say about the rest of the media? What does this say about our elected leaders, about parents and about voters?

Second, whenever I challenge psychometricians about their assumptions of unidimentionality, they retort that their methods are robust to some amount of multi-dimensionality. They report that their statistical methods do not break down when faced with data that is not stricitly unidimensional. Of course, I accept that. But that does not mean that the results the report mean at all what they think they do. Validity is about “interpretations of test scores for the proposed uses of tests” (The Standards for Educational and Psychological Testing, 2014, p. 11). Even if the statistics yield a result, the use and acceptance of vertical scales — even if only the suggestion of a vertical scale with NEAP — shows how little considerations psychometrics gives to validity.

I suppose that there’s a third takeaway, though it is less far-reaching. Matt’s question about NAEP scores has long since been addressed. In 2012, David Thissen wrote about the question of the NAP and vertical scales. “The conclusion of this essay will be that evidence can and should be assembled to support, and make more precise, interpretations of the first kind (“one year’s growth”), while interpretations of the second kind (cross-group comparisons across four-year spans) should be discouraged.” This work was done under contract with the publishers of NAEP, and yet it has take up neither of his suggestions. They should do better.

Excellence is Multi-Dimensional

My high school experience back in the 1980’s was a bit odd, in quite a few ways. For one, it was an almost brand new school when I got there. It was a new public exam/magnet school and for various reasons, they the district decided to just let in one class at a time. So, the first year, there were just freshman. The second year, that first class rose to be sophomores and my class joined. It wasn’t until it’s fourth year that we had seniors, and that first class was the top class their entire high school careers.

I was on a competitive team from my freshman year, and there were two real stars in the class above me, but they took very different paths with very different strengths. One was rock steady, always doing what he could do, without mistakes. The other was more mercurial, with more brilliant moments mixed in with too frequent mistakes.

Now, both of them were excellent. But one was steady at a high level, and the other had more variation from meet to meet. Sometimes James exceeded Peter, but sometimes James fell short.

Throughout our high school years, Peter raised his level. He remained consistent, not making mistakes. But he did that a higher level of performance each year. Through those years, he nearly caught up to James’s peaks. Similarly, James also improved. But for James, improvement had to mean addressing those mistakes. Through those years, he nearly caught up to Peter’s consistency.

Back then, I thought that I was more like James. I wanted to be more like James. I wanted to reach those heights, and I did not yet realize that James and Peter were converging. I saw them embodying two contrasting archetypes. And I certainly did not appreciate the value of consistency or of simply not making mistakes.

It was not until late in college that I really started to appreciate that James was not better than Peter. I did not understand the value of reliability — particularly when that reliability comes with a high level of performance. Yes, I still see value in moments of peak brilliance, but I value consistently far more than I used to.

Consistently avoiding mistakes that you are capable of avoiding individually requires a kind of focus that I did not have as a teenager. While I have gotten better, it is still sometimes hard for me. Whatever the reasons, it does not come easy to me in any domain.

As a adult, I see incredibly value in avoiding downsides, potholes and mistakes. I see reliable contributions from colleagues, reliable friends and reliable recipes. The staples of our lives, of our work, of our pantries are so under-appreciated. Delivering every day and being able to count on them make everything else so very much easier.

This was true on my high school math team. The most thoughtful football analysts say it is true of running backs, too. It is an under appreciated kind of excellence.

Who Make Decisions about Goals and Resources?

Recently, someone tweeted to me, “I have lots of faith in teachers to implement learning properly. I have less faith in schools and admins to set the proper goals and resource appropriately.”

We are in an era of decreasing trust of teachers and schools. Of course, we are in an era of increasing distrust of all institutions, so this shouldn’t be so shocking. And while trust in teachers remains quite high, is has declined a little bit in recent years. Teachers now trail only nurses and medical doctors, but they used to rank higher. (They are still far ahead of police officers, judges and bankers. Local office holders and members of congress are net a little and very much distrusted, respectively.)

Nonetheless, it is quite striking that someone would distrust schools and administrators to “set the proper goals and resource appropriately.” These simply are not the jobs of teachers or school administrators.

Educational goals are laid out in state learning standards. These state standards are developed by educational professionals, researchers and policy-makers, and then customized for various states. Finally, these customized standards are ratified and endorsed by state legislatures. For example, Florida customized the Common Core State standards and the Next Generation Science Standards and calls them standards The Sunshine State Standards.

Educational goals are not set by individual teachers, individual schools, districts or their administrators. Educational goals are set by state legislators.

Educational resources are similarly out of the hands of schools and educational administrators. States are the primary determiner of educational resources — again, through acts of state legislatures. Local municipalities also contribute to educational resources through local government budgets. Again, it is local elected officials who make these decisions. In some areas, the school district has the authority to levy taxes, instead of the general local government. But this is done through elected school boards. In none of these cases are schools or administrators responsible for these decisions. In all of these cases, it is elected officials.

Of course, the federal government contributes ~10% of school resources,. Here, it is Congress that decides. Again, elected officials.

To be fair to all of those legislative bodies, their acts usually have to be signed by an executive. Thus, it is not the legislatures alone who do set standards or set resource levels. But they are all elected officials.

Now, where I live, we actually vote on he town budget every year. My local town government does not have the power to set budgets. Rather, it’s elected officials puts together a budget for the citizens of the town to vote on. Occasionally, a town budget somewhere does not pass, and the town government must put forth a new proposal for citizens to vote on. This American Life recently did a piece on a a contentious effort of citizens to radically alter a school budget. But no where in any of this do schools or school administrators sets budgets.

It is incredible that people distrust teachers and administrators to do things that they’ve not be responsible for in generations.