Fisking the Haladyna Rules:Introduction

In 1989 and in 2002, Haldadyna and his colleagues published compilations of item writing rules from the literature. These two lists, which I call The Haladyna Rules have been incredibly dominant ever since. Virtually every serious source of test development refers to them as the item writing guidance. In 1989, Haladyna & Downing called their list, “a complete and authoritative set of guidelines for writing multiple-choice items” (p. 37) and then they updated and refined it further with Rodriguez in 2002.

The problem is that the list is just not very good. It focuses on presentation and perhaps test takers’ confusion, but really gives rather little guidance about how to make sure that the item actually tests what it supposed to test. Now, we call this idea that a test (or item) is supposed to test what it is supposed to test, “validity.” There are actually much more complicated ideas and principles for validity, but at the fundamental level, the question is whether the test is doing what it is supposed to do. And the Haladyna rules that so dominate the “Item-Writing Guidelines/Rules/Suggestions/Advice” (1989, p. 40) simply steer clear of this idea, for the most part. (Among those more technical concerns and views about validity is that it is actually how the test is used that matter. But whether a test even can be used appropriately depends deeply on whether it is actually testing what it is supposed to test.)

But that is not the worst thing about them. I will save the worst thing about for the end of this series.

You see, there are 31 rules in the 2002 version. And there are 31 days in October. (Hello, lady!) So, each day of the month of October, I will analyze one of Haladyna, Downing and Rodriquez’s 31 rules. In order.

Yes, I am saying that rule 31 is the worst thing about the list.

Buckle up.

Giving Writing Advice

Both as a high school English teacher and as a dissertation coach, I have lot of experience giving people advice on their writing. As student, a collaborator and researcher (and occasional blogger) I have a decent amount of experience receiving advice on my writing. I have observed that there are two basic approaches. Both aim to help the writer produce a better piece, but they come from very directions and have quite different goals.

(Side note: Unless someone specifically asks for it, proofreading is really not nearly as valuable as so many people seem to think. Proofreading a piece that is not yet otherwise in its final form can be a little bit useful, but the fact is that many words and sentences and even paragraphs are going to change, creating numerous opportunities for little mistakes to be (re-)introduced later, anyway. This kind of feedback is not advice, as it doesn’t actually help the writer to know whether the piece works and/or give them any ideas for how to improve it.)

The approach that I aim for tries to build on what is there in the piece, to help the author do a better job of meeting their goals, of communicating what they want to say. This approach requires reading the piece for the author’s perspective and asking questions about it to give advice that is centered on them and their hopes for the piece. Sure, this sort of advice might result in wholesale reworking on the piece, but only to make it more clear what they are trying to get it, and to improve the piece’s ability to convince others or to convey their idea to others. This approach is centered on them and their thinking, helping them to write the best version of their piece that they can be most proud of. This approach is all about supporting them. I learned to do this from my own best teachers—because they were focused on helping my grow as thinker, rather than just tearing apart where I was at the time.

The alternative approach is centered on the concerns and interests of the advisor, instead of the writer. The advisor thinks about what they would most like to read, or other ideas that the draft remind them of. They offer suggestions for making the piece more like something that they would write, perhaps stylistically or even in the ideas and/or perspective in the piece. This kind of advice is actually less about improving the piece than is about making it feel more intuitive to the advisor.

Of course, there are a couple of circumstances in which the second approach is actually quite appropriate. If the advisor is actually a gatekeeper who gets to decide what is published on their platform, they may be concerned about the voice or content that their platform consciously tries to cultivate. They may make exceptions sometimes, but they know that they will not do anything like that in his case. Such advisors are in a position to put their own preferences ahead of the writer’s own concerns, as the writer likely seeks the piece to be included there. In such cases, the writer can decide whether it is worth it to them to give in on what they wanted to say or on their voice in order to be included on that platform. Though that advisor is trying to exercise a lot of power, it remains up to the writer to accept that deal, or not.

The other circumstance is much much more rare. There are occasional advisers who are really good at having a feel for the target audience of readers—as opposed to just a sense of their own personal preferences. Such advisors would not actually be pushing their own voice and/or ideas, but rather channeling what the that desired audience might respond to best. That might call for some serious shifting of the piece, even if it shift somewhat from the author’s original intentions.

This dynamic is not limited to advise on a piece of writing. I first noticed this dynamic when I was a teacher and paid close attention to the advice that principals and assistant principals give teachers. I cannot tell you how many times I had directed at me or heard directed at others advise along the lines of, “Well, when I was teaching, I would…” What I never heard from them was, “This never worked for me, but I think something that would work for you is…” On the other hand, I did hear senior teachers say, “You should watch how [other teacher] handles that. I think that they do something that would work for you.”

As a teacher, I want to help my students to better express what they want to say. If it is not working or I do not think that it can work, I might engage in a conversation to find something related from their own perspective and values that they might want to say instead. But I always always try to find the kernel of what is really important to them to help them express that in a way that is true to themselves. Obviously, as a dissertation coach, the goal is a form of writing that does not feel natural or authentic to anyone, but that only makes it more important that the ideas and perspective are deeply grounded in their own experiences and views.

There is something deeply frustrating about advice to writers (or others) that marginalizes their views and goals. Sure, it might yield a better piece, though that is questionable. It is simply more likely to yield a piece that is more agreeable to the advisor, but that is not the same thing.

My Most Conservative Policy Opinion

I believe in democracy.

It is not that I believe that democracy always comes to the right answer. It is not that when democracy comes to an answer different than my own that I am convinced that I am wrong. But I was taught that politics is how groups of people make decisions about values, and in this country our political system is a democratic one. And you know Winston' Churchill’s famous description, “Democracy is the worst form of Government except for all the other[s].”

What is the alternative? Divine right of kings? Oligopoly? I believe that the more people that vote, the more legitimate the decisions are. Surely, high quality education matters, because an electorate of well informed voters will yield better decisions than one fill with ignorant and thoughtless ones. This is what so many state constitutions say; this is why we need strong universal public schools.

Of course I would rather everything be done according to my values and my wishes and my judgment. But if we are instilling all the power in a single person, it’s not often going to be me. So, I prefer the wisdom of the crowd to any other mechanism. (With some sorts of constitutional limits, of course.)

And yet, for my whole career, I have seen educators and others deny the validity of democracy when they don’t agree with The People. They want to override the will of The People and the will of their communities. They think they know better and should be able to apply their own moral judgment in place of our democratic system of government and laws.

Now, I understand the challenge of a situation in which one’s own values are quite at odds with the values embedded in the law. I understand the outrage of seeing marginalized and disempowered people (and groups of people) further harmed by the actions of governments—especially when those people are children. Especially when those people are children. I feel that deeply.

But what is the alternative? I do not want an impassioned minority to impose its mistaken will upon me. How can I tell the difference between those situations and my own desire to impose my own minority will on others? How do we determine a system of government on any level if every impassioned minority thinks that they have the right to impose their will upon others?

We have seen better paths. We have seen the civil rights movement engage in direct action and civil disobedience. We have seen a Ghandi-led colony throw off its colonial oppressor by exposing the moral excrescence at the center of that enterprise. We have seen attitudes change—and the law change—as more people learned more about same-sex love and LGB people around them. We have seen paths.

I am deeply saddened—and at times infuriated—at what appear to me to be right wing minority views of what schools may do to support and educate children. I have seen irrational fear mongering and unrealistic views of what is actually developmentally appropriate for children at different ages, built upon willful ignorance born of arrogant self-centered intellectual laziness. I have been appalled and morally outraged. These asses exist in my community, as they exist around the country.

Luckily, I live in the part of the country where these asses have not gained control over the levers of government, where they so obviously in the minority and cannot hide behind other culture war issues. I know that, therefore, it is easier for me to preach democracy. I have no children being targeted by them. I know that, therefore, it is easier for me to preach democracy. It is easier for me to have patience for the democratic mechanisms of discussion and debate to work their way through elections to policies.

But what is the alternative?

Manipulating Item Difficulty

There are these psychometric guidelines on desired item difficulty levels. Items should not be so hard that too few test takers respond successfully and not so easy that too few fail to respond successfully. (Yes, this view is based in empirical observations about item difficulty—that is, the share of test takers who respond successfully.)

That’s just garbage. That is, if a test is supposed to be criterion-based, that is total garbage.

I guess it’s a question of what tests are for and what they are supposed to measure. If the goal is to sort or rank test takers, there simply is not a lot of information in such items. Sure, there is some, but not as much.

But certification/licensure exams and most educational testing is criterion based, not norm based. If the goal is to measure what test takers know (and don’t know) and can do (and can’t do), then it should not matter how empirically difficult an item is. If it is a criterion referenced test, then those items offer us just as much information as more moderate difficulty items.

Of course, this comes down to unidimensionality. If the goal of the test is some unidimensional results, highly difficulty or highly easy items might provide less information. But if the goal is to report on the many standards (or other elements in the domain model) in the construct, those easy and difficult items might provide large amounts of information about their targeted cognition.

Yeah, the idea that items should not be too easy or too difficult comes from, again, assumptions of unidimensionality.

Furthermore, there are other huge problems with this sort of thinking about item difficulty. It makes some sense to define item difficulty to empirically, but that actually shifts attention from the deeper meaning. Some ideas and lessons are more difficulty to teach and learn, and therefore get more attention from teachers and more practice from students. Other ideas and lessons are easier to teach and learn and therefore get less time and attention. That difference in attention to teaching and learning moderates the difference in difficulty when empirical definitions are used. So, what does difficulty actually mean, in terms of cognition?

Of course, it doesn’t stop there. The quality of instruction (as opposed to the quantity discussed above) also impacts learning and subsequent performance. The best teachers might even combine higher quality instruction with higher quantity instruction for some lessons—perhaps because of how difficult the content is and perhaps because of how important they think the content.

I got just one question wrong on my Chemistry SAT II subject test back in the dark ages. It was about what substance is purple when in a solution in water. I asked my chemistry teacher about this the next Monday and she knew the answer immediately, she just did not think this as important enough to spend time on in class. KMnO4? And 35 years later, I can’t really disagree with her. But it is not like this is a difficult idea or lesson. It’s trivial. It’s low cognitive complexity. It’s just memorized knowledge, and a small amount of memorization at that.

So, if this question never appeared on an important test before, it would likely not be taught in many chemistry classes. But if it appeared on every important chemistry test, it would be taught everywhere. Would that change the difficulty of the concept? As it was taught more, should the item be manipulated to maintain its difficulty. For example, the word purple might be replaced with the word aubergine. Would it be a higher quality item when altered to increase empirical difficulty?

Items should elicit evidence of the targeted cognition for the range of typical test takers—ideally without Type I or Type II errors (i.e., false positive and false negatives). Their difficulty should be determined by the difficulty of the targeted cognition and the quality/quantity of the teaching and learning about the targeted cognition. If the goal is report on learning, knowledge, abilities and/or skills, there well could be some items that everyone responds to successfully and some that no one responds to successfully. That should not be a problem.

And if the psychometric scoring models say otherwise, they should be replaced.

Barbie, Patriarchy and “Patriarchy”

Forgive me for posting some film commentary.

I really liked the Barbie movie. I think it is really smart, and somewhat subversive in how it makes its feminist arguments. (I will try to avoid spoilers.)

There are two main feminist arguments presented in Barbie. They are presented quite differently, and serve very different functions in the narrative.

The most important narrative driver of the Barbie movie is the idea that there is an essential emptiness in one gender living without power, without access to meaningful or purpose, other than to be seen and valued by the other gender. Ken voices that he is in this position very early in the film. This existential state — and its untenability — is there when the audience is first learning about Barbieland and addressed both by and in the film — to almost the very very end. This is part of Betty Friedan’s Feminine Mystique. The Barbie movie explicitly addresses the untenability of this sort of role, but inverts gender roles in order make the point. Nonetheless, this remains a powerful feminist thrust of the film, one of the most powerful feminist arguments of the 20th century, even though it is presented in the film as the state of men’s positions. It is not overly presented as a feminist argument or problem, just as Ruth Bader Ginsberg — before she was appointed to the courts — used male plaintiffs to illustrate how our society categorizes and determines sex role problematically. 

The second main feminist argument in the movie comes from the middle aged mother, when she talks about challenges of being a woman amidst contradictory expectations that leave little room — or no room — for actual ordinary variation in experience and behavior from day to day. 

In my view, the feminine mystique argument is about patriarchy, though the film does not make that claim. Of course, the film cannot make that claim, because the Kens do not live in a patriarchal world when they feel the effects of this dynamic. Yes, “patriarchy” is mentioned many times in the film, and something called “patriarchy” is a major element of the narrative. But this feminine mystique issue is not a part of any so-called “patriarchy” in the film, even though in the real world it is entirely a function of pervasive patriarchal expectations for social structures and relationships. 

On the other hand, while I think that the mother’s explanations of problematic dynamics around dichotomous expectations for women are both powerful and importantly feminist, I think that it is a stretch to say that they are just about patriarchy. To claim they they all are rooted in patriarchy is the claim that all (social and interior) problems suffered by women are due to patriarchy. I believe that society’s subtle and overt, rare and pervasive, internalized and social subversion and undermining of women has broader roots than simply patriarchy, and it is too easy — to the point of intellectual dishonesty — to put them all at the feet of patriarchy. And this means that this second major feminist argument is actually not about actual patriarchy — despite how it is positioned in the film. 

I walked out of the movie theater trying to figure out how anyone could seriously object to the its political contents and messaging, particularly object from the right. I suppose people who feel and say that patriarchy is good — and who use the word “patriarchy” to do so — should be offended by the film. After all, the film’s depiction of “patriarchy” does not address anything about patriarchy that they would defend. Certainly, the vapid Kens (spoiler alert!) do not come to any deep understanding or the nature or power of patriarchy. The only intelligent response from the right that I can imagine is, “If that’s what you mean by ‘patriarchy,’ I can see why you object to it. We object to that, too." As I have thought more (and hopefully deeper), I have realized that the film’s theme’s and idea’s relationship to patriarchy or even more complicated that that.

For example — to spoil a joke, though not anything about plot or character — the idea that love of horses is particularly a masculine or patriarchal thing is obviously false. Girls love horses, too. In our society, horses might even be stereotypically more a girl interest than a boy interest — and they certainly were growing up in my family.

Thus, while the Barbie movie — a fun film full of many funny moments — has a lot to say about feminism, certain ideas of masculinity, self-identity, Mattel, consumer markets, patriarchy, the challenges of being a women (or girl) and the nature (and significance) of Barbie, it quite often masks  — or even mislabels — what is saying. Sometimes, it is explicit and on the nose. And sometimes, it is more subversively subtle than that. It twists these two main feminist arguments into an interesting plot and a — for me — politically satisfying resolution.

Barbie covers a lot of ground, and it should not be taken at face value or trusted to be presenting everything as obviously as it presents some things.


Authenticity in Large Scale Standardized Assessment

Shocking though it may seem, content development professionals (CDPs) value authenticity in test items. Unfortunately, there are many constraints on test development that prevent CDPs and other test developers from prioritizing authenticity as they might otherwise wish. For example, the simple fact of limited “seat time” (i.e., how long a test can be) can be an obstacle to including many aspects of authentic work on such assessments.

But authenticity actually plays out quite differently in different content areas. In all cases, authenticity can increase test taker engagement. In content areas, authenticity can help tests to get at the actual standards that both instruction and assessment are supposed to target.

A major problem for ELA test development is the use of commissioned work for reading passages. This can solve many problems, but it often creates a real authenticity problem. Passages written specifically for assessment lack an authentic purpose. For example, letters to the editor are not to real editors and from real concerned readers. Little commissioned stories — or supposed excerpts from longer works — do not generally have the same quality as those that are published by commercial publishers.

Because so much of ELA instruction — and even more of ELA assessment — is aimed at students’ skills at reading and understanding the kinds of texts that they may encounter in their lives, authenticity in ELA assessment often comes down to text selection. Authentic texts are simply critical to authenticity with ELA. Items that ask questions that readers might naturally have about those texts also contribute to authenticity.

Authenticity in science assessment is quite different. The kinds of writing about science that naturally exist in the world are rarely an appropriate grade level for large scale assessment, and may include many issues and topics that extend beyond any one grade level’s standards. Authentic writing about science is rarely appropriate for inclusion in large-scale standardized assessment.

Nonetheless, authenticity is prized in science assessment. It just takes a different form than in ELA. In science, authenticity comes in the nature of the phenomenon at the focus of the item or scenario set. When it feels like it comes from test takers’ own world and lives, the phenomenon contributes authenticity. When items ask the kinds of questions that a real person — or perhaps even a scientifically minded person — might have about the phenomenon, they contribute authenticity.

This is not to say that relevant topics do not contribute to authenticity in ELA assessments. Certainly they do. But even less relevant topics can feel authentic when an author breathes the life into them that we see in great writing. But in science assessment, it is the topic — the phenomenon — that primarily drives authenticity, rather than its presentation.

Art, Science or Professional Practice?

We hate the formulation that some things are more art than science, and every one of its variants. We think that they miss the nature of art, of science and of the practice being commented upon.

Half of the idea of this formulation that is that there are some fields that are predetermined and objective. These are fields with rules — even laws — that render every decisions a technical decision with definitive answer. Those are said to be science.

The other half of the idea in this formulation is that there are some fields that are creative and free, without standards or rules. They are full of personal judgment and preference, instead. Those are said to be art.

We think that this is bullshit.

First, that is most definitely not what science is. Science is a process of inquiry, of trying to understand the natural world, of knowledge creation. This process is creative, involves jumps of intuition, doubt, verification, correction, and iteration. The things that this scientific process has taught us are uncertain and always subject to further refinement — and at times even just replacement. Science never gives certain answers, and the authentic scientific process is full of subjective decision-making.

Second, the effort to liken something to art is often a backhand compliment — at best. It denies their existence of expertise, the importance of learning and the very idea the product of the work is appropriate for examination for effectiveness. It denies the possibly of real advancement.

We far prefer recognizing when fields are full of professional judgment. In our view, this means employing the tools for a diverse toolbox towards a particular purpose. Yes, it involves weighing multiple factors without a single clear definitive formula. And yes, it allows for multiple paths to a high quality product. It is not as objective and predetermined as that fantasy of science supposes; instead, it involves the complexity and judgment of actual science.

Art does not have clear standards, or clear goals, and not even a definitive audience. Professional practice has clear goals, clear purpose, recognizable standards of quality — even if they are not definitive — that can be recognized by other professionals. Professional practice should become more effective at achieving its goals over time — both the practice of an individual professional and capabilities of the larger profession. This fundamentally unlike art, which does not have clear goals or purpose, it is more free to evolve in unpredictable directions, which do not necessarily constitute advancement as much as sampling shifting.

In fact, creativity exists in many professions, including science and engineering. Creativity is not at odds with the scientific process or scientific advancement. Creative problem-solving is a part of engineering and a part of mathematics.

We certainly deny that item development is more art than science or as much art as science. These ideas denys that content development professionals develop professional judgment, that the field has standards or that its products can be evaluated by CDPs for effectiveness. Instead, it insists that everything is really more of an opinion than anything else.

Content development work, like most every profession, relies on professional judgment. Professional practice is the application of that judgment in applying the knowledge and tools of the profession, towards the goals that that profession values.

Professional Credibility of Content Development Professionals

One of the earliest motivations of the Rigorous Test Development Project was the disparity of standing between psychometrician’s and content development professionals (CDPs). The people who edit and refine the items that appear on tests simply did not have the same seat at the table as the people who did the statistical analysis of test results.

Thus, an original goal of the RTD project has been to raise the standing of CDPs – as they are the ones who work on item validity, a necessary foundation for test validity. (note: CDPs are not just item writers — who write initial drafts of items. CDP work is much more complex and extended than mere item drafting.)

The most respected professions combine copious technical knowledge with well honed professional judgment. Aspirants spend years, studying their discipline before engaging in some form of apprenticeship, in which they gradually build their own professional judgment as they observe, and or supervised by more experienced practitioners.

Less respected professions are not recognized as having the same quantity and difficulty of technical knowledge, nor requiring the same degree of professional judgment, in order to be practiced at a high-level. Thus, respect is often a function of perception and disrespect a function of ignorance.

Professional certification exams exist in many many fields, and yet every one we have ever met who has taken such an exam complains that it fails to assess what it really takes to perform the work. We believe these exams tend to focus too much on the technical knowledge (often taken out of authentically complex and interconnected contexts). Thus, they fail to address the higher level skills and professional judgment that is the mark of true expertise – – or even successful practice.

There are are bad lawyers, bad doctors, bad architects, bad, beauticians, bad plumbers, and bad project managers. In many of these fields, one cannot even be a bad practitioner without being professionally certified. Licensed professional certification does not ensure quality, and it certainly does not ensure respect for a the profession.

In our view, there are two things that truly distinguish the work of psychometricians and of contact development professionals that contribute most strongly to this disparate standings.

First, psychometricians study their angle(s) on assessment in school and obtain advanced degrees in their field. They can earn masters degrees and PhD’s in measurement. Content development professionals may or may not have advanced degrees, but they are not degrees in assessment. (They may be in teaching or in their subject matter field.) Moreover, not every CDP holds an advanced degree.

Second, many people have more respect for STEM fields than other fields. They perhaps perceive those fields as harder than other fields – confusing the fact that many STEM problems have obviously and definitively correct and/or incorrect answers with the ideas of rigor and difficulty.  They seem not to understand the difficulty of coming to a truly high-quality result when it is not always easily obvious whether the result is correct or incorrect. The need to balance multiple criteria to find a truly good – – or even great – – answer is not always as respected as the kind of technical work that is built on predetermined routines and literal formulas, if those routines and formulas involves lots of numbers.

So, what can we do about the lack of standing of CDPs?

Professional licensure is no guarantee of respect from those with PhD‘s or degrees from the most respected professional schools.  Certification exams do little to ensure respect or high-quality work. (In fact, regulation is usually about providing a floor for quality, rather than raising the ceiling on performance.) Of course, that presumes that there is someone who could establish and perhaps require such licensure, which is questionable in the first place.

Our efforts in the RTD project are focused on codifying the kinds of knowledge, skills and decision making that CDPs engage with every day, both the assessment-specific examples and the ways that subject matter expertise and knowledge of the perspectives of test takers inform the assessment work. We can envision this body of knowledge becoming a formal course of study as an option in educational measurement option, someday. But that is form of credentialing is not the goal of our Project. We are still trying establish the body know knowledge and skills in a form that CDPs and non-CDPs both can see.

When if comes to the standing of CDPs in test development circles, we ask that non-CDP test developers, take note of the variety of technical knowledge that CDP tap into, and their use of professional judgment to balance issues that psychometrics often barely has language for.

Perhaps this way that can engage the kind of professional humility that is so important for learning and for collaborative work, and will better understand what their CDP colleagues can bring to the table.

Undermining Reading Standards


As a former high school English teacher, teaching reading skills is very important to me. This goes beyond mere phonics and decoding. At the high school level, we focus more on the content of the text. This can include implications and layers of meaning. It can include quite dense information. Highly proficient readers take many different issues into account.

The reading standards differentiate reading informational texts from reading literary texts. The basic idea is that reading a novel or short story is quite different than reading a newspaper article or a text book. I am not sure that the reading standards quite get at the distinctions as I might wish, but there truly are important differences.

We we have been working on understanding cognitive complexity of reading items on tests, we have faced a truth that makes each of us a bit upset: test prep encourages reading literary texts as though there are informational texts.

Literary texts should generally be read and enjoyed linearly, from beginning to end. They are usually narratives, and while their structure might push a reader to refer back to earlier pages from time to time, they generally move forward. On the other hand, reading informational texts is often much more purposeful. The reader of a literary text should trust the author to take them where the author wishes, the reader of informational texts might bring much more purpose to the task. They might be looking for particular information.

Therefore, authentic reading of informational texts can be much more strategic and intentional than reading of literary texts. But so many test questions about literary passages turn those passages into informational texts. The reader is looking for the answer to a particular question — about some detail or the meaning of a particular excerpt – rather than taking in a making meaning from the text more organically. This means that students can be encouraged to apply strategies of informational reading to what should be literary reading.

I am not sure how to get around this problem. Certainly, I am no fan of such directed and often simplistic items — even the best versions of those items. Authentic reading and appreciation of literary texts — be they fiction or non-fiction — should not be picayune in its focus or goals.

Certainly, the emphasis on multiple choice items — generally for their speed and scoring economy — is a problem. In fact, they might be incompatible with this sort of reading (and therefore reading standards).


What is the Purpose of a University?

Last month, I declared that it is our job to teach. If law schools expect their students to tolerate (or even protect the rights of) those making low quality and/or offensive arguments – perhaps even made in bad faith – then they should teach their students how to do so. It is not enough simply to demand that they do so. They should explain why this matters and how to put up with such garbage.

Again, demanding proficiency with important skills is not what educators do. Instead, we teach. Or, at least, that’s what I think.

But perhaps I am wrong. Perhaps I misunderstand law schools. Perhaps I misunderstand universities.

And so, I am asking myself, what it is the purpose of a university? What should the priority of a university be?

  • To teach and educate students?

  • To protect free speech?

  • To create knowledge?

It matters how an institution might prioritize these different ideas because occasionally they may come in conflict.

Or, perhaps I am painting with too broad a brush? Perhaps we can ask:

  • What is the purpose and/or priorities of undergraduate institutions, among educating students, protecting free speech and creating knowledge?

  • What is the purpose and/or priorities of professional schools (e.g., medical, law, education schools), among educating students, protecting free speech and creating knowledge?

  • What is the purpose and/or priorities of graduate schools of arts & science (i.e., the folks who do PhD’s in traditional disciplines like history, philosophy, biology, psychology), among educating students, protecting free speech and creating knowledge?

Perhaps, being a K-12 educator, I mistake the purpose of these…what do we call them? “Institutions of higher learning?” Isn’t that the traditional name?

I am trying to approach this issue with intellectual integrity and intellectual humility. I highly value free speech — that’s a big reason why I have been a member of the ACLU for decades. I am trying to question my conclusions and re-examine my reasoning.

But I am having trouble seeing how free speech is more important than educating students. I totally see how these institutions should educate students about free speech. I totally see how free speech generally supports the various missions and priorities of such institutions. But these institutions should primarily focus on the dissemination of information, primarily through educating students.

Am I missing something?

Chat GTP/LLM's Improvement Ceiling

Ten years ago, Siri was a quite impressive artificial intelligence/personal assistant. It was amazing. But it is no longer impressive or amazing. why not?

Well, the approach they took — the basic architecture and technology behind Siri (and the others in its cohort) — turns out to kinda be a dead end. It intrinsically has some issues that cannot be overcome, and those issues make improvement elsewhere difficult when it’s not impossible. That approach could only take us so far — in spite of the fact that our phones and their servers are astronomically faster than they were 10 years ago. This is not a hardware issue. It’s just in the basic architecture. And this idea that some technological approach is limited and will hit a ceiling is not unique to that kind of AI. It is true of every technology, be it hardware or conceptual. 

So, now we have the hardware to do this LLM (large language model) approach, as best known in the form of Chat GPT. A different approach. 

So much of the hype around these LLMs is really centered on their improvement and the idea that they will get much better quite quickly. Don’t think about the limitations today, we are told. Instead, we are supposed to imagine improvement that overcomes those limitations.

Why should we imagine that? Are they suggesting there really are not any real limitation? 

I don’t buy that for a second. I really don’t. No one should. Of course there are limitations! Of course there are diminishing returns! Of course this LLM approach has intrinsic strengths and intrinsic weaknesses. LLM is not the route to a all powerful supreme being.

So, what do we know about all LLM instances? They call them “hallucinations.” They do not understand or care about what is true. They make up shit. I’m sure that there are other issues, too. But this is the one that really gets to me.

Is this "hallucination" problem intrinsic to LLM? Is this a problem that can be overcome? At this point, the burden is on the hypesters to explain clearly why it is not. 

And let me ask you: would you ever trust an AI personal assistant who could not be trusted to give you true answers? Who might make up directions or make up books? Who cannot be trusted to be honest and accurate? Who does not even care about being accurate? What useful role could such an AI play in your life? 

LLMs are a neat trick. I am sure they are useful for a wide range of things. I might even find some uses — perhaps mostly professional, though not entirely — but I do not  see that this approach is going to get us to where so many people so fervently would like it it.

Yeah, there exists an improvement ceiling, and I think we already know what it is.

It Is Our Job to Teach

Recently, there has been a brouhaha at Stanford Law School about how students protested and disrupted the presentation of a speaker invited by the school’s chapter of The Federalist Society. In short, the invited speaker was provocative, students were provoked, some protested quietly in a non-disruptive fashion and others appeared to want to disrupt his presentation. It was disrupted. He was rude. They were rude — or at least disrespectful. 

The question at hand is whether universities — and particularly law schools — should respect norms of free speech that allow controversial — or even odious — speech, or whether there is some speech that is so disrespectful/harmful that it falls outside those protections.

(No, free speech does it include trying to drown out or disrupt others. Almost all free speech questions can be be address with the answer of “More speech,” not “Less speech.”)

The school’s Dean, Jenny S. Martinez, published a response to the contretemps, siding with the idea that the answer is more speech, disrupting the presentation was wrong, and that this is a particularly important value at universities. She explained that lawyers need to be able to listen to and respond intelligently to the arguments of others. Blah blah blah.

Now, I agree with her. I agree with almost everything she wrote. I agree with the blah blah blah parts, too.

And yet, I think she left something very very important out. I think that she is right. I think that I am right. But there is another perspective — not mine, but one that I want to understand — that she should have addressed. She barely waved a hand at it, and really didn’t even do that.

Imagine this perspective, though it is not mine:

Imagine that you are not part of a traditional elite, not part of a group from which the powerful elite have traditionally come. Imagine that you have lived your life surrounded by popular depictions and assumptions that people who (perhaps superficially?) resemble you are lazy, dumb, criminal, alcoholic, uneducated, criminal and/or otherwise marginal, disempowered, and objectionable. Imagine that you feel deeply defensive that the culture and the elite do not view you as a truly worthy equal, that perhaps you should be tolerated but that you will never actually belong. 

I don’t have those experiences. That is not my story. But it is the story of many students. 

I have no doubt that deep education requires intellectual — and perhaps emotional — risk-taking, requires openness and requires trust. Learning to think differently is difficult and even challenges one’s identity. It puts one’s old values at risk, requires one to look at the world through different lenses and thereby forces a different relationship to oneself and everything that one has ever known. That is a lot. 

So, that is a lot to ask of people who have a lifetime of experiences to tell them that the culture around them does not trust them, does not think well of folks like them. What ought an educational institution do to make their success at this endeavor of such deep learning more likely to succeed? What kinds of support should it supply? When should it introduce challenges? In what manner — and with what timing — should it start to remove those supports?

I have no doubt that educational institutions have a responsibility to educate, and not merely to demand proficiency. No doubt at all. That’s basically a tautological statement, that educational institutions should educate their students to instill in them their most important lessons.

And I agree with Dean Martinez that it is really important that lawyers be able to hear and respond thoughtfully to arguments that they believe are without merit and/or even made in bad faith. Listening to that garbage and not losing your shit? That, that’s a lawyer skill. It’s incredibly important. 

It appears to me that Stanford Law School’s answer to throw their students into the deep end. Perhaps to announce that being able to swim is really important, and then throw their students into the deep end — all without any real effort to educate them on this important skill. 

And without any consideration at all to how making students who were never taught to swim constantly nervous that they might be thrown into the deep pool…never considering at all what that might do to their necessary trust in their professors and the institution. 

Stanford law should teach this. All law schools should teach this, if it is such an important lawyer skill. Merely demanding proficiency is not enough. A one-time lecture on this is not enough, be it in writing or some other form. A one-time workshop is not enough. 

And Dean Martinez should be a lot more mindful of the difference between the behaviors and dispositions we expect of experts and the supports that educators need to provide to help our students to become experts. I agree with her goal. I agree with her views on freedom of speech on campus. But I think she sorely misunderstands the needs of her students and her responsibilities towards them.

The Most Dangerous Idea in Large-Scale Assessment?

Perhaps the most dangerous idea in large scale assessment is the idea that items assess standards, as opposed to assessing test takers. So much sloppiness and inappropriate inferences (and uses) of tests — which means they are not valid tests! – are caused by that enormous mistake.

If items directly assessed standards (or KSAs), then it would be possible to examine an item to see whether it is aligned to the standard without considering test takers. Item developers could just think about the ideal of an item and the ideal of an item that is aligned to a particular standard. They would not need to know, understand or think deeply about students.

But items actually assess students’ proficiencies. They are about test takers’ cognition — which is why they are called cognitive assessments. And test takers vary. They vary a lot. They vary in proficiency, in background, in experiences and in how they were taught. They vary in they command of other proficiencies. One might be a good reader and another a poor reader, making the word problem a very different challenge for two student with similar arithmetic ability. 

We say that valid item elicits evidence of the targeted cognition for the range of the typical test takers. We take that idea that test takers vary very seriously. There is a range of typical test takers for an items — a multi-dimensional range. 

Different test takers might find a different entry point into an item. They might have a different first thought. They might have a different initial guess. They might have a different next cognitive step, after that initial guess. They might consciously apply different strategies, or be differently aware of how they are getting to their answers. Because they have had different teachers who used different explanations or or different examples, two test takers can view the exact same item with different views of its novelty — and all that that implies about finding a path to a successful response. 

Test takers vary.

But little in item development training or item development practices dives into how test takers vary. There is little — and usually none — documentation about the different ways in which a standard is taught or the different kind of common mistakes and/or misunderstandings that potential test takers have with the standard. 

Instead, we too often rely upon one adult’s view of what the most likely reaction and cognitive path towards a solution might be — too often done without thinking and without the appropriate humility that there are many other potential reactions and paths.

In spite of all of this, people expect individualized score reports and people make individual inferences about test taker capabilities based on a test that assumes that test takers all react and think the same — an assumption that is logically at odds with the range of standarsd and the idea that different kids will get different items wrong. 

This idea that items assessment standards is particularly ill-suited to be paired with the expectation that these tests can deliver useful information about individual test takers. Even without that expectation, the frequent mismatch between real live test takers and the assumed aristotelean ideal of a test taker means that even the aggregate results rarely well reflect reality across the tested population. 


The limits of parental authority in America


Of late, I’ve been confronted far more about people claiming that parents should have vast amounts of authority over their children and their children’s education — in ways that not only disturb me, but that I think are actually un-American. 

In this country, we use democracy and elections to make community decisions about values — often disputed values — and how those values should be enacted and enforced within our communities. Yeah, democracy is a really poor way to do that. Recall that Churchill said, “Democracy is the worst form of government — except for all the others.”

So, we do not use the New Testament to make these decisions. We do not go to oracles or prophets. There is no supreme leader or wise person who decides these things for us. We trust the wisdom of the crowd, we use elections and our Constitution — the only founding document for our government — says quite clearly that authority to do this comes from The People, “We, the people.”

(I love the Declaration of Independence.  I read it in full most years. It’s got one of my favorite lines ever, “He has plundered our seas ravaged our Coasts burnt our Towns and destroyed the Lives of our people.” Man, that’s some good rhetoric! But that’s what it is, rhetoric. The Declaration might be a founding document for our country, but it is not a founding document for our government. There is only one founding document for our government, and that’s the US Constitution — which makes clear that the government’s moral authority stems from The People.)

So, America trust the people, collectively, to make decisions about values. At the same time, however, one of the truly distinguishing characteristics about this country is how we value the individual and individualism. While “liberty" and “freedom" mean many many things, one thing that they mean in this country is that individuals should be free to engage in their own decision-making about their own private lives and property. We prize stand-out excellence and self-efficacy in this country. 

There is a tension there. No doubt. Where do we draw the line between the zone of private individual freedom and the zone of the public that is controlled by the People (and their elected representatives)?

Well, that line has moved through the years, and not just in one direction. In some ways — in many ways — the private zone has expanded. But it some ways, it has shrunk. Well, not so much shrunk as better delineated that other people exist with equal moral and legal standing.

For example, at the nation’s founding, women had few rights. Very few rights. Women could not necessarily own property. When my parents were born a married woman could not get her own credit card. Oh, wait….let me correct that…when I was born, a married woman could not get a credit card in her own name. Married women did not have bodily autonomy at all, really — their husbands had virtually unconstrained control over their wife’s bodies. But we now recognize that women are people, too. Women — including married women — have all the same legal rights as men. So, the zone of private control by men has shrunk, but the zone of private control by women has expanded.

I’m not going to go into slavery and Dred Scott, but you know…

This takes us to children. They clearly should not have full control over all private decision making; they are just children. We used to set the age of majority at 21, and only just reduced it 18 for the Baby Boomers (and succeeding generations). Children have many many rights, but not real control over their lives.

That poses the question: Who should determine what is best for a child?

Now, very few parents are experts in child psychology, developmental psychology, nutrition, medicine, moral instruction, curriculum, etc.. Even those that are, well…lawyers make the worst clients, doctors the worst patients and virtually every shrink’s kid is all kind of messed up. When our emotions and sense of identity get more involved, we are often unlikely to make the best decisions. So, why should we trust that parents will always made the best decisions for kids?

Sure, we hope hope that most of the time they will make the best decisions for kids. They are usually more invested in those children than anyone else is. Most parents want what is best for the kids, at least most of the time, or at least they think they do. But all parents are fallible people. Some of them are not great people. Some of them are sometimes not great people. As a matter of convince, we have to trust them most of the time. 

But should we trust them all of the time? Why should we do that?

Are children more like some piece of private property which owners have huge amounts of control over? Are they more like beloved family members, like wives? Should we trust some adult (or two) to make all decisions about a child — like husbands used to be able to do for their wives? Or, should the community be the ultimate arbiter? Should we have safety net for children, because we — as a community — value our children’s well-being that much.

Back in the day, people had to use animal abuse laws to protect children because the law did not conceive of the idea that children would have rights of protection against mistreatment by parents, just like the law did not protect women from their husbands. 

Honestly, I do not trust that parents should be the final an ultimate arbiters of what is good for their children — just as I do not trust husbands to be the final arbiters of what is good for their wives.

Rather, as an American, I trust We, The People. 

Recognizing What We Optimize For

Life is about tradeoffs. Work is about tradeoffs. Work-life balance is about tradeoffs.

Nothing is perfect in every possible way. Instead, tradeoffs grounded in external pressure, priorities and/or values aim towards some sort of acceptable balance of different factors and criteria. These tradeoffs really are about values.

There are many different values that we might bring to large scale assessment development, including

  • trust of educators.

  • information for parents.

  • feedback for students.

  • information for policy makers.

  • reliability.

  • validity.

  • testing/seat time.

  • cost of development.

  • operational costs.

  • scoring costs.

And no doubt many many more. 

Some of these potential values have clearly been given lower priority, and some higher priority. 

Now, The Standards for Educational and Psychological Testing say (in the second sentence of the very first chapter) that validity is "the most fundamental consideration in developing tests and evaluating tests.” I quote this line all the time. Validity is the alpha and omega of assessment — or it should be.

I am concerned — I have been concerned for a long long time — that validity is not given the priority that our bible says it should be. I am concerned that we optimize for reliability, at the expense of validity. I am concerned that we optimize for time, at the expense of validity. I am concerned that we optimize for cost, at the expense of validity.

Reliance on templated items and various forms of automation can deliver less expensive tests. But they will be worth less because of the increasing costs to validity. If the goal of these new efforts and technologies is to invest those cost savings and time saving back into validity, perhaps they can be worth it. But if the vision is to merely save time and money, they are just another drain on validity, when our tests are far from having the validity to spare.