Why I Love Rubrics

My oldest concern about education policy and practice is the meaning of grades. I started to wonder about standardized tests in middle school, but my questions about grading practices go back even further than that. Assessment is my longest running obsession.

I have had teachers that would lower a grade because students’ names were not in the correct corner of the page, or the order of name, period and date were wrong. We know that handwriting quality can impact grades. And, as a new teacher, I learned that many of my colleagues kept grade books, but would simply eyeball them to assign a grade for the term, rather than do the simple work of actually averaging the recorded grades. (This was back in he 1990’s, when grade books were all on paper.)

And I never understood why 50% had to be a failing grade, or how anyone could do such good work on a project or paper that it was absolutely perfect (i.e., 100%).

The arbitrary and inconsistent methods used to calculate grades — either for an individual assignment or for a marking period — baffled me, and at times infuriated me. To this day, I do not know what a B+ means. Is it mastery of the content but poor organizational skills? Is it mediocre performance on the content, bu hard work, diligence, sweetness and all that extra credit?

Today, I can defend grades better than I could then, but I still have lots of problems with them.

But rubrics address most of my concerns, at least on the assignment level.

  • Rubrics lay out what the is relevant to the grade, in their dimensions/traits.

  • Rubrics lay out what each level of performance should look like

  • Rubrics can give advance notice to test takers of the criteria.

  • Rubrics help teachers to be more consistent across students.

  • Rubrics lay waste to the practice of 50% meaning a failing grade.

  • Rubrics can give students clear direction on where they need to improve their performance.

  • Rubrics can help teachers to ignore things that they should ignore.

  • Writing rubrics is a good exercise to help teachers think about the learning goals they have for students.

This is not to say that rubrics are perfect. Poor rubrics create enormous problems. Teachers who ignore their own rubrics when grading (or who use them improperly) undermine the whole idea of rubrics — and the trust that students should have in them.

But rubrics can be flexible. Rubrics can be tailored for individual assignments, or set up as a grading system to be used over time. Explaining rubrics to students can provide scaffolding for students to self-monitor their own progress and to think about where they want to focus their work and attention.

Moreover, rubrics are invaluable for standardized assessment. When standardized tests use constructed response items (e.g., essays, short answer, fill in the blank, show your work), scorers need guidance and structure to ensure they they are consistent through the day and are consistent with each other as they score responses. Even automated scoring of this kind of item is based on training sampling generated by human scorers using rubrics.

In fact, we cannot use constructed response items on standardized tests without rubrics. If we did, the scoring could not be consistent, and that would violate the very definition of standardized. I have no doubt that improving the quality of our standardized tests to a truly acceptable level requires more constructed response items, which means that I want a lot more rubrics.

Understanding the Doctoral Literature Review

I help a lot of doctoral students to figure out how to write their dissertations. (That’s dissertation coaching.) I have found over and over again that they do not understand the purpose of the the literature review.

Academic research can be quite different than other research that many people do. In much of our lives, we are looking for support for our position. Trying to find the evidence and ideas that will help convince others to our position. But that is not what academic research is about, at all. (Or at least, it never should be.) At other times, we are trying to learn, for ourselves. What is out there? What is known. Or, what can I learn from doing this? Again, that is not academic research.

Instead, academic research is about building knowledge. Not one’s own personal knowledge, but rather the knowledge that we — as a field or discipline — have. Academic research is about contributing to that knowledge of the field. It is about real discovery of something new, or deeper or more specific examination of something we don’t quite have nailed down, yet.

This means that the researcher — in this case, the doctoral student — must know what has already been done. That is, they must know investigate what has come before to make sure that they are unintentionally reinventing the wheel. The fact that this researcher does not know what others have done is simply not not an excuse. Instead, it is their responsibility to do the library research and investigations to find out what has come before.

I call this dynamic of building on and contributing to the literature scholarship. That is, there is a scholarly conversation going on through journal articles and academic conferences. When young or new researchers — like doctoral students — conduct and write up their research, they should take part in this conversation. The first way that works is to acknowledge what has already been said by others, so that the researcher can build upon it and respond to it. That acknowledgement is the literature review.

This allows researchers to stand on the shoulders of the giants that have come before them. This goes to the old aphorism, if I have seen further than other men, it's because I have stood on the shoulders of giants. Heck, Google Scholar — an invaluable tool for anyone conducting or writing a literature review — adopted he motto, Stand on the Shoulders of Giants!

So, the literature reviews tells the reader, this is what has come before, this is what we — as a field — already know, and this is what I am building on. This is the scholarly context for my research.

The doctoral dissertation’s literature has an additional dimension, as well. The doctoral dissertation is a masterpiece, in the oldest sense of the word. That is, it is work done by a aspiration to be reviewed by the guild to determine whether they have have the skills to be deemed a master, to be accepted as a fellow master. Therefore, the doctoral dissertation must do all the times, and do them better and more completely that typical work. Being subject to that kind of examination, the doctoral dissertation must make all the skills and knowledge clear.

This means that the doctoral dissertation’s literature review must be more complete and in-depth than the more typical literature reviews one might see in journal articles. Doctoral candidates must show make clear to their committees that know how to take part in the scholarly conversation, that they have the skills to find the scholarly conversation and understand it. They do this by showing off how well they have done and written about this one.

That’s a lot. In my view, it is the most intimidating part of the doctoral process. It is ok to be intimidated by it. But there are way to make progress. Your program — or dissection coach — should be able to get your started, help you to strategize and help you to reorient. Then, they should help you to figure out how to write it all up.

I will write more about some smaller issues with literature reviews next week, but there’s one more thing to say here: what the literature view is not.

Doctoral dissertations are not a student’s personal or professional philosophy. They are not a life plan. They are not everything that the student wants to say. They are a very completely and clearly executed piece of research. There are places in the dissertation for the students to opine, philosophize and even rant, but those are quite specific, and none of them are in Chapter II (i.e., the literature review). Chapter II is about what others have researched, written and reported.

Literature reviews do have room for the student/author, but it is really mostly implied. The organization of the review is full of implied judgments. What is mentioned first? How in depth is the explanation of this study or that article? What is is reported about this project or that book?

Furthermore, no studies or articles should be excluded simply because the student/author does not like how they came out. The literature review should not designed simply to support the author’s desired conclusions. They should not cherrypick the literature that they like. Not only is that unethical, it is actually counterproductive. If this project works out as the author expects, it can be more powerful for offering evidence against prevailing ideas in the literature. If there are division in the literature, that only serves to justify the need for this study.

And so, doctoral dissertation literature reviews should play it straight, without being slanted to support desired conclusions. They are not about convincing anyone of a results, but rather just to inform the reader about what is known and that the doctoral candidate has truly investigated what the state of the scholarly conversation on the topic

RTD Theory of the Item (Draft)

As RTD is an item-focused approach to test development, we had to sure that we knew what items are and how they function on tests with test takers. We needed to develop RTD because we could not find anything that did this for contend development professionals.

Psychometrics does not examine items or look inside of them. Rather, psychometrics treat each item as a black box that a produces some small about of data about each test taker. It takes all that data and analyzes in a number of sophisticated ways to examine the relationships between items and what the patterns in the data say indicate about each test taker. Because psychometrics does not offer tools to examine the contents of item — which is where the cognitive content, construct and KSAs (knowledge, skills and abilities) are found — its view of items has almost nothing offer about test validity. That is, psychometrics has almost nothing to offer about whether tests actually assess what they are purported to assess — neither on the item level nor the test level.

Obviously, for those who care about validity and item-content domain alignment, that is a huge problem.

The public has a very different view of items and the psychometrics offers. It usually takes for granted that tests and items measure what they are purported to measure. It may simultaneously accept that their is some unfairness in standardized tests, but it is generally thought to be fairly minor. The public’s biggest objection to to tests appears (to us) to be that some people are just bad test takers — not that tests or items themselves are flawed. It has stronger objections to big standardized tests than to other tests, but origin and basis for those objections is unclear. At times, we see claims that these tests are racist or classist, but usually without explanation of how that is. That is, there are objections about test scores without explanations of how racism or classism infected them.

Again, a view that does not offer anything useful to improving tests, validity or items.

We had no doubt that items matter, and that their are differences in item quality. That there are good items and bad items. That some items do their jobs better than others. We figured that their job was to examine test takers for particular KSAs (or targeted cognition), and had seen too many items that we believed miss their marks. But we lacked a theory of framework for explaining that. We certainly lacked a framework that could be used to support item development and that could connect the various ideas, principles and practices that contribute to item development.

This led us to think carefully, for years, about what items are. We knew we needed to explain the relationship between an item’s goals and ideal functioning and what actually happens when they go wrong. We knew that we needed to explain how item developers connect targeted cognition and test takers. We knew that we needed to explain how test takers respond to items.

Eventually, we developed the RTD Theory of the Item (TotI). The figure below offers an illustration, but the explanation in our our book (in progress), Rigorous Test Development: A Practical and Conceptual Guide for Content Development Professionals is where you will find the really good stuff.

You can download and read a preview of our Theory of the Item chapter – still a draft, but one we feel actually explains what items really are.

Rigorous Test Development (RTD) Theory of the Item

Rigorous Test Development (RTD) Theory of the Item

RTD: The Book

We are finally working on the book, Rigorous Test Development: A Practical and Conceptual Guide for Content Development Professionals.

This book is intended for the people who develop standardized tests, on the item level. There is a startling lack of training, certification or professionalized/disciplinary knowledge available for them. There are no virtually no textbooks, journals, courses, degree programs or professional development programs. There is nothing that explains how they fit into the larger test development process, that gives them a conceptual view of their work or tries to treat like intelligent and decimated professionals who want to learn and grow.

This standards in contrast to psychometrics, who can get masters degrees and doctorates, who have countless conferences, journals and online courses available to them.

As in so many areas that we care about, are worried about that quality of the work that we so care about, and want very much to help it to improve. We know how the sausage is made, and we want better processes for making it so that better sausage is available to everyone.

Yeah, test items are the sausage, in this metaphor. But you knew that, right?

We know that low quality items can only produce low quality tests. You simply cannot make chicken salad out of chicken scat. We care enormously about item quality because we know that standardized matter and are not going away.

Which (new school) begs the question: what even is item quality? Well, that’s in the book! The industry doesn’t really have a good definition of item quality — though we do. No one offers ideas about how to think about item quality or how to improve it. All of that is in the book, and we will write abut it a little bit here jin the weeks and months ahead.

The thing that really makes RTD different is that our lens for duding item quality is the interactions of test takers and content. It is a content-focused view. We do not think that item quality can be judged without looking at the contents of the item, and the KSAs (knowledge, skill and/or ability) from the target domain that that the item is targeted. Statistical analyses based on data from field or operational testing can help, but that data comes far too late, far too expensively and never gives any insight as to why an item is performing well or poorly. We offer frameworks for thinking about that.

So, we are seriously working on the book. It is coming. In the weeks and months ahead, we will offer previews of the thinking behind various chapters, and next week we will even offer a preview of perhaps the most important chapter, The RTD Theory of the Item.

Object Lessons and North Paulding High School

I have spent the last week thinking about North Paulding High School. This is the Georgia high school that temporarily suspended a student for posting a photo of the crowded hallway there, in which almost no students were wearing masks. This is the high school whose leadership said that mask wearing was a personal choice and they could not effectively enforce a rule on it, despite having a basic dress code.

The first reason I keep thinking about North Paulding High School is that I have been there. I have sat in the principal’s office. When I began in policy analysis and program evaluation, this school district was one of my first clients. I grew to know the long-term superintendent well, and became a huge fan one of the district’s principals.

Of course, that was ages ago.

So, I was quite surprised to see Paulding in the news.

Despite my old relationship with this school district, I was immediately highly critical of their decisions, their excuses and their actions. I am on the school dress codes are usually sexist and hold girls responsible for boys immaturity bandwagon, and my issues with attempts to control teenagers — as opposed to influence to teach them — goes back much further. Of course a school can enforce a mask mandate! Not a lot of sympathy from me, right?

Except…I thought that North Paulding was getting a raw deal out of the coverage.

Sure, they deserved all the criticism they got. No doubt of that! But didn’t deserve all of the criticism. I did not think that this was the only high school in the country doing this. I did not think that it was the only high school in Georgia doing this. Heck, I did not even think that it was the only high school in Paulding County doing this.

Were there others trying to suspend students for bringing light to a situation? Well, I have seen that before. We have all seen school leaders claim that criticism from students — including embarrassing valid criticism — is “disruptive” and therefore may be barred, under the law. We have all seen basic disrespect for student rights. We have seen students try to stand up for their First Amendment right and get intimidated by the powerful into stepping back.

And I do not believe for a second that every other school in Georgia both has a mask mandate and is enforcing it. I do not believe that every other school in Georgia has figured out how high school students can safely pass from one class to another, through the day.

And yet…I believe in object lessons. This might not have been the best one, but there it was.

Sometimes, we are faced with something that did not work out. A minor snafu. And larger fuckup. A huge clusterfuck. Something. Regardless of what it is, I believe in learning from it. Really leaning from it.

Those failures, regardless of their scale, are authentic. They occurred in real contexts, in which real people were doing what they actually do. If the screw up(s) happened, they well could happen again. Whatever lead to the screw up(s)…well, if it is not addressed, why wouldn’t they happen again?

I do not agree with anything that happened at North Paulding High School. And I do not think that coverage was fair. But I think we can all learn from it. This can be an object lesson. We can all think about how and why it happened, but not as some hypothetical example written by some trainer or consultant. No, it really happened. We know what happened. It can be an object lesson.

Our school district might not get exactly to that point, but I’ll bet that your local school district has some of the same issues that led to North Paulding’s embarrassment. I am sure that mine does. My own organization can suffer from some of those issues, too.

We can look at this one example and reflect on what it might tell us about our own weaknesses and vulnerabilities. It can help us to think about where might we might make similar mistakes.

So long as we are willing to learn. So long as we are always looking to learn. The object lessons are available all around us.

Understanding the COVID Challenge for School and District Leaders

There has never been a time in my professional career — or in my life — when I have less wanted to be a school or district administrator.

You see, there is no good answer. There is no good policy. There are no good choices.

Though I have friends and colleagues and clients who work leading schools and school district, I know that I can barely imagine how hard this time is for them, whether they are trying to figure out what to do in their 2020-2021 school year.

Despite what so many preach and yell about, schools and school districts are very light on administrators and support staff. Relative to other industries, the work force is very concentrated in the line workers, the ones who do the core work of the organizations. That is, school districts are full of teachers and teacher aides. There are some school nurses and some counselors. That’s the people who work with the student. But supervisors and support staff? Other kinds of expertise? No, schools and districts are very light, in that regard.

Think about how teachers an average school principal is responsible for supervising and evaluating. Even if you split that up across the principal and the handful of assistant principals, the number of direct reports for administrators in schools is exponentially larger than what we see in virtually any other business.

I have peached for 25 years that if your team does not have slack capacity in it on a regular basis, then your team will not have the capacity it needs when a crisis hits. You can be efficient along the way and then suffer when the crisis hits, or you can be resilient then when crisis hits. You can be more efficient across time, without the major setback of the crisis, if you are willing to give up some that presumed efficiency along the way.

But we do not run our schools that way. Our schools are crowded, our teachers are under-supervised and under-reported and our school leaders are barely supervised or supported, at all. There are well over 12,000 public school districts in this country and fewer than 1000 have more than 10,000 students. To get a sense of that scale, that’s a district with more than two high schools. All the others are tiny, and are led by tiny district offices.

Where are these districts offices going to find the capacity to reinvent schooling over a summer? They can’t..

But the larger districts really can’t either, because there are no good options.

  • Distance learning is inferior to in-person learning, even simply judging on the traditional content in the explicit curriculum. While there are claims about efficiencies in online learning, no one really claims that it is better for individual students and there is are no credible studies that show that it is.

  • Distance learning widens inequality gaps. Everything that has contributed to those gaps over time is made worse in the COVID era.

  • School facilities — the buildings themselves — have suffered deferred maintenance for decades. HVAC systems are old and creaky. Windows may or may not open. They are not suited for a pandemic respiratory virus.

  • Schools are generally crowded. Not all are that over-crowed, but they are not sparely filled. Do simply do not have the space to spread students out. When there is declining enrollment, we shut down buildings or shift things around, so that we don’t have to fix up the ones we have.

  • State’s departments of education have always been short-staffed and under-resourced. They do not have any expertise in the areas that school and district leaders need help with.

  • One of our national parties has been against the US Department of Education ever since President Carter raised it to be a cabinet level post. Their presidential candidates and nominees have run on reducing or even eliminating USDOE. Other federal departments get to focus on emergency preparedness and disaster planning, but not USDOE. It lacks the resources to do that kind of work, too.

  • And there is no additional funding. States and municipalities are short on tax revenue and education is the largest line in the budget. Money will be cut, not added. The federal government is doing nothing to support schools and district in the COVID crisis, and we all know which party is keeping that from happening.

The thing is, other parts of our county cannot really come up with good solutions, either.

  • We know too little about the virus. Science takes time. Knowledge evolves and grows as get more opportunities to learn. We just don’t know.

  • Too many people want easy and simple answers, binaries that make decision-making easier. Children can’t get coronavirus, they want to think. Or, if they do, they cannot transmit it. They don’t get sick. But all of those are untrue.

  • We still lack the testing capacity in this country promised to us many months ago. We simply lack information about the state of the pandemic today, and individuals cannot get timely test results, even in the limited testing we do do.

  • A sizable number of people refused to take the most basic precautions to prevent spread of this disease. Among those who do wear masks, a ridiculous number lower their masks when they talk — which is the worst time to lower them. People wear masks without covering their noses.

  • Almost no one is acknowledging the long recovery period for many who get sick and even fewer acknowledge that there are long term health consequences, even perhaps cognitive effects.

If you had made decisions for your organization in this context, what would you? Could you come up with a for your people and customers that was safe?

But schools and distracts have it worse.

  • Small children squirm and move around. How can you keep them apart?

  • Teenagers can know the right thing to do, and yet still have trouble actually doing the right thing. That’s just where their cognitive development is. That’s just how their brains work, at their age.

  • Teachers almost always love their work and love their students, but they also love their own families and need to worry about their own health.

  • There has never been a ready supply of capable teachers to replace or augment the ones we have. We cannot find average teachers to replace the ones who are too vulnerable to work in schools and we cannot find appropriate extra staff to allow classes to be significantly smaller.

  • Teachers already work longer hours than most people understand, and asking them to significantly increase their workload by teaching both in-person and online is simply not possible there are not enough hours in a day.

  • Teaching online is a different skill than teaching in person.

There simply is no good answer. There’s no mediocre answer. Every school and district leader is trying to choose from among a bunch of really bad answers. They are thinking about children getting sick, their schools being vectors for spreading a pandemic through their communities and about their staff members dying.

We venerate fire departments and police departments who do not work nearly as hard as teachers do because when the shit gets real, they put their lives on the line for the rest of us. Cops love to talk about how much potential danger they face every day on the job. Well, we are asking teachers to face far more danger than cops ever do, and to do so every day.

In 2019, fewer than 100 officers were shot and killed in the line of duty, and we venerate them all for it. Teachers constitute approximately 1% of our nation’s population — and that does not count all the other school personnel. 1% of the deaths we have had so far in this COVID era is well over 1,000 people in just six months. Fewer than 50 cops were shot and killed in the line of duty in 2019, and we will see more than 200,000 Americans die this year in this COVID crisis.

We are asking school leaders and district leaders to come up with answers that will meet the needs of children and communities while bearing a moral and emotional burden to keep those children safe and to do right by their own people, the teachers and other personnel who work for them.

There are not good answer. There are no mediocre answers. There are only horrible answers.

Radical Empathy

Radical Empathy is a term that we have been using to describe the the most demanding and difficult step of our rigorous Item Alignment Examination procedure. Others use it it in different ways in different contexts. but it is an an important term for us because of it is about a very important tool in the CDP (Content Developer Professional) toolbox.

Radical Empathy is an open-minded iterative process of trying to think through an item (i.e., all the way from the beginning of the stimulus through completing the work product) from as many relevant perspectives as you can. It stands in contrast to consciously working through an item as yourself, or as you remember yourself to have been at an appropriate age.

Empathy is, of course, at the center of Radical Empathy. Reading though an item with empathy is to try to read through an item through the lens, in the shoes and wearing the hat of another person — possibly a person quite different from oneself. Obviously, the more different the person from you, the more challenging this act of empathy is.

What makes Radical Empathy so radical is its iterative and open-minded stance. The point is not merely to capture another perspective. Rather, it is to try your best to imagine the range of perspectives that one might find across the range of typical test takers. You need to take each of those perspectives and work through the item, carefully and consciously, wearing that lens, shoes and hat.

Thus, in addition to thinking from the perspective of a young person — usually from a whole different generation — you must consider different educational backgrounds, different experiences of the world, different knowledge and skills, different amounts of patience and focus, in addition to the more traditional demographic issues of gender, race/ethnicity, English language status, urbanicity, etc..

Radical empathy is about capturing as many of these perspectives as possible and carefully thinking through how these differences in perspective might influence the cognitive path that different test takers take as they work through items. However, any individual’s ability to imagine others’ perspectives is necessarily limited by their experiences with other people. Thus, efforts at Radical Empathy — such as in Step IV of Item Alignment Examination — call on content development professionals to pay attention to opportunities to learn more about others’ perspectives, to be open minded about the existence of perspectives they had not previously considered and to be humble about their confidence in their understanding of the perspectives they had considered.

We think that this is quite difficult. We think that every part of this is difficult — from staying in just a single other person’s perspective all the way through an item to maintaining that humble stance. However, because valid items elicit evidence of the targeted cognition for the range of typical test takers, we believe that this is necessary.

The Inefficiency of Education Spending

One of the great criticisms (i.e., oft voiced) of our education system is that spending (per student) has shot up over the past 30-40 years, but test scores have barely budged. While the degree of the spending increase is usually overstated, and test scores have grown more than critics give credit for*, the criticism actually is true. 

(*Obviously, there is no clear standard for how much scores should increase, but the real issue there in the cherry picking of one test that has shown the least increase. This exaggerates the lack of increase in test scores, as not adjusting spending for inflation exaggerates spending increases.)

However, this criticism entirely misses the mark. In fact, the moral mission of education requires our system to get less efficient over time. 

Though there have been occasional technological break throughs that have greatly increased the efficiency of education (e.g., books), education has always been a very labor-intensive task. While technology can lower the cost of information dissemination, that has long been the easiest element of schooling. So much of the work of schools entails diagnosing errors and holes in students' thinking through examination of their work and what they say, leading to individualized  explanations and scaffolding to support their learning. Of course, monitoring and maintaining student engagement and motivation is similarly individualized. And this says nothing of the basic daily childcare role our schools. 

There's no reason to expect efficiency gains in these areas, despite our ever-improving technology. But that does not explain increasing costs. Increasing teachers wages in amid a growing economy and standard of living is, of course, a portion of it. But there is something much deeper pushing against increasing efficiency in schooling. 

Our schools -- everyone's schools -- historically have not attended to all students equally. We have long tolerated or even encouraged some kinds of students to drop out. Heck, back in the day we did nothing to encourage them to enroll in the first place.

Who have we focused on? Children best prepared to learn. The smartest students. Students from the most stable and supportive families and communities. Students most readily able to learn, who need the least support from their schools. The cheapest students to educate.

But for most educators, the moral mission of our schools is to educate all students. All

That means that we need to put greater efforts into reaching students who need more support from their schools, students who are more expensive to educate. These can include students with physical, emotional, mental and/or learning disabilities. These can include students whose parents lack even high school degrees. Students from families less able to support their children's educational pursuits. Students from families whose social and economic conditions are more likely to impede than to support school work and learning.

In the last two decades -- under Presidents George W. Bush and Barak Obama -- there has been incredible federal pressure on schools to pay attention to lower-performing students. Schools have been judged by their success at bringing up those students, rather than being able to claim success merely by focusing on students already doing well in school. 

The moral mission of our public schools requires them to target the more difficult (and expensive) to educate students. Success in our public schools comes when we make greater efforts to reach these students, to close achievement gaps, and bring up every student to the level they need to be successful in their further education and/or work, after high school. 

Of course we want to get better at reaching all students. Of course we would like to find ways to reach all students that are less resource intensive -- in large part because as  we free of up resources from some students, we can do a better job of reaching other students. But we have so far to go in the mission of reaching all students that and efficiency gains will -- and morally ought to be -- used to do still better with students whose performance we judge lacking. 

Unlike so much of the world -- unlike the world of commerce and profit motives -- the moral missions of our public schools requires educators to seek out what others might call the worst customers, or the least profitable customers. Charter and other private schools can target their marketing and counsel out students they find too challenging to educate, but public school success is often defined as doing well with precisely those students. 

Complaints about long term declines in school efficiency are not only misguided, but actually immoral.


Perhaps the Biggest Problem: Misunderstanding Bias and Error

"Bias" has a particular meaning in the field of measurement. Fortunately, this means is not that far off from our colloquial/every day use of the term. In measurement, it means systemic error in a particular direction. This meaning highlights the fact that there is another class of error, the kind that is not systemic in a particular direction, "noise."

Unfortunately, too many people -- including too many psychometricians and other professionals in the field of measurement -- do not really recognize bias, and therefore our use of measurement suffers for it. 


Noise is endemic in any measurement. Measurements are always a little off -- maybe a little high, maybe a little low. I need 2 tablespoons for sugar, and maybe I grabbed 26 grams, rather than 25. Or maybe it was 24. Or 24.2. Or even 22.8.

The more careful we are, the measure our instruments, the better we can do at reducing noise. But we cannot eliminate it. Our measurements will always have some random error component.

In cooking, if we are careful and actually use the right tools, the noise is too small to matter. 

In other applications, the noise can matter more. In educational measurement, where we make important decisions for and about children, the noise can be very important. We know this, and we have statistical tools to help us recognize it, to help use quantify it, and to help us to think about how to reduce it.

The primary way to deal with noise is to make longer tests. Seriously. And this works. Because noise is -- by definition -- random, in the long run it will cancel out. Test with more items (i.e., questions) actually lead to less noise in the final score. In this case, more is better. In this case, adding more noise leads to less overall noise, because random error can cancel out

This is not how bias works. 


At the annual NCME (National Council on Measurement in Education) meeting in Washington, DC this month, I had a whole bunch of interesting conversations with other people. These usually happened immediately after a session ended, as I spoke with someone I'd never known before about something we heard from one of the presenters.

Prof. Mark Reckase gave a presentation that focused generally on the differences between educational measurement and psychological measurement. In this speech, he mentioned the Hippocratic Oath, the idea that doctors pledge to, "First, do no harm." It would be wonderful if we had that professional ethical standard in educational measurement. After the session, I spoke with another attendee about this. I was saying that if we were to live by that standard, sometimes we just wouldn't test, we wouldn't use a measurement, we might not sell/license a test to some customers. But he didn't understand.

I tried to give an example, speaking of the problem of gender and racial/ethnic bias in job interviews. Unfortunately, our candidate screening processes tend to perpetuate the make up of our companies. People are more likely to see their own positive traits in other people who look like them and who have similar backgrounds. It just is harder for people to see positive traits in people who look different and come from different background than in those who are already similar to them. Thus, an argument could be made that in-person interviews might do more damage than help, and if we lived by the "first, do no harm" standard, perhaps we should just skip them entirely -- even though they are a well established practice. That is, the fact that we have always done them might run right up against the "first, do no harm" standard.

This other gentleman insisted -- over and over again -- that the answer was to do more interviews. That if there was error in the interview process that doing more interviews would compensate for it. 

He was confusing random error/noise with bias. He had in his head that bias is just a form of error, and the answer is to let the error cancel out.

But bias does not work this way. 


Imagine that you have a measuring cup that is off. Image that it is just too small, by 10%. 

Each time you use it, there will be a random component to the error. You won't get exactly 0.9 cups every time. You will get a little more or a little less than 0.9 cups. The random error/noise will be in addition to the bias. Therefore, carefully measuring out 32 "cups" to get 2 gallons will lead to a really good chance that the noise has cancelled out (for the most part), but you'll still be around 10% short. 

If it bunch of item are individually biased a little bit against girls, then using 32 of them won't fix that problem. It will produce a score that is just as biased against girls. 

The answer that the measurement industry appears to use is to add a bunch of items that are biased against boys to the item pool. The thinking seems to be that these biases will cancel out. They want to turn bias into noise, and think that they can make it cancel out. 

And they do the same for other forms of bias, too. They think that they can just make the bias cancel out.

Unfortunately, this doesn't work for individual test takers. Even if the strategy was sound, it is not applied for individual test takers. A balanced item pool is one thing, but test takers don't take item pools. Test taker use individual forms, and I do not know anyone who examines individual test for (or adaptive generated forms) for gender bias, or urbanicity bias, or racial/ethnic bias, or SES bias, or any other bias. 

Validity problems cannot be turned into noise. Construct underrepresentation -- a huge problem in educational measurement -- cannot be turned into noise. Dumbing down of content for the sake of our testing technology cannot be turned into noise. Lowering the cognitive complexity of items to accommodate time limits on our tests cannot be turned into noise.

These are all biases in our tests. But too often we forget that not all error is noise.


The Right to an Education

What does it is mean to have a right to an education? I've been thinking about this, in light of talk this month about the right to free speech.


The First Amendment to the Constitution is the source of our right to free speech. But clearly this is widely misunderstood. Rather, our speech is generally free from government interference, but not even in that it is limited. Though defamation, slander and libel are difficult to prove in court, the government does limit these. There are many limits on commercial speech, as well. On the other hand, our political, artistic, literary and scientific speech are almost entirely free from government interference.

However, the government does not protect our speech from others. We can be fired for what we say. We can be dumped. We can lose friends. We can lose customers. Our freedom of speech is really just about freedom from government interference or regulation of speech. 

On the other hand, there are other freedoms that the government does protect. The government will take action against those who discriminate, for example. We cannot be denied a room at a hotel or a meal at restaurant for the color of our skin or our religion. The federal government will protect that right. We must have a right the same educational opportunities, regardless of our sex. The federal government will protect that right (provided there is any link to federal funding). 

Many of our rights are to be free from the government. So, of our rights are ensured by the federal government against others. But what about education?


Unfortunately, we do not have a right to an education -- at least not on the federal or national level. The word education does not appear in the Constitution, and any federal role in education is questionable. Some argue that is tied in with interstate commerce, and therefore a federal issue. But generally, our acceptance of a federal role in education has nothing to do with the Constitution. It has developed and grown over time, and we generally accept it -- even though some argue against it.

That leaves our right to an education as a state matter. In fact, the right to an education varies from state to state. The feds have addressed special education and both racial and sex discrimination in education, but other than that it is up to individual states. 

How much education do we have a right to? What quality of education? How much geographical equality? Equality of opportunity or equality of effort? How do we define effort or opportunity for these purposes? All of these are state matters. All of them. 

Our right to an education comes from state constitutions, state laws and state regulations, as implemented by state and local government and as interpreted by state courts. Not federal. 

Moreover, there is very very little regulation of traditional private schools. The federal government has ruled that we cannot be compelled to attend a public school if we would rather attend (and pay for) a private school. If private schools take any federal funding, some regulations apply (e.g., the sex discrimination stuff). Otherwise, private schools are regulated like any other business (either for-profit or non-profit). There is a bit more regulation of charter schools, but this might just a product of how much government funding they take. 

So, the government is not going to step in an assure your right to an education. No one is going to assure you that your private school is actually providing an education to its students. The federal governed tried to force states to step in with public schools to assure that they are providing a decent eduction, but is now pulling back. (Pulling back from an effort that did a piss poor job of defining or measuring an education, anyway.) 

While most (all?) states do promise a right to a free public education, the quality of those schools is far from assured. In fact, it appears that while we may have a right to an education, we do not have a right to decent education.

What is the Point of Teacher Evaluation? Seriously, What is the Purpose?

Recently, Rick Hess has written about the pointlessness of new teacher evaluation systems, claiming – in his headline – that they "Don't Make a Difference." But I think Rick might be missing the point. (He is basing this on the research of Matt Kraft and Allison Gilmour, which I have downloaded but not yet read, yet. They might be missing the point, too, but I can't be sure, yet.) 

The point. What is the point or the purpose of teacher evaluation? Well, I can think of a few possibilities, but the bottom line of each and everyone one of them would have to be improving outcomes for children. How might teacher evaluation systems do that?

  1. Identify struggling teachers for removal.
  2. Identify struggling teachers for targeted intervention to improve their effectiveness.
  3. Intimidate struggling teachers to remove themselves. 
  4. Provide a structure or framework for struggling teachers and those who support them to think about teaching, so that they can better improve their effectiveness.
  5. Provide a structure or framework for all teachers and those who support them to think about teaching, so that they can better improve their effectiveness.

That's basic mechanisms by which a formal teacher evaluation system may improve aggregate teacher effectiveness, and thereby improve outcomes for children.

But I think there are more than that because mechanisms #1 and #2 are ambiguous in who must know the identity of the struggling teachers. If it is just the teachers themselves, then mechanisms #1 and #2 are the equivalent of #3 and #4, but there remain multiple possibilities. Perhaps struggling teachers should be identified to their local supervisors? Perhaps they should be identified to their local peers? Perhaps they should be identified to their district offices? Or to the public, or the the state of feds?

Each of those implies a somewhat different mechanism. Peers might support a struggling teacher in other ways than a supervisor, and systems by which peers decide on the removal of ineffective teachers – usually called Peer Assistance and Review (PAR) – do exist in some places. Certainly, targeted assistance and termination procedures are quite different if they bare based on local supervisors rather than the district office – and likely different again if based on state officials or feds.

So, it's pretty complicated.

But here's where I think Rick (and likely Mark and Allison) is making his big mistake: many of these mechanisms do not require accurate reporting of teacher effectiveness. Many of these mechanisms are not undermined by fudging the public or official recording to the advantage of the struggling teacher.

So long as a teacher, his/her supervisor and/or his/her peers know that this teacher is struggling, the mechanisms based on improving his/her effectiveness can still work.

So long as a teacher knows that s/he is struggling, s/he can still leave. So long as a supervisor knows that a teacher is struggling, s/he can still pressure the teacher to leave. So long as a supervisor knows that an untenured teacher is struggling, s/he can fire that teacher without citing ineffectiveness as a the reason. 

Let me say this again: Teacher evaluation programs do not have to accurately record which teachers are struggling/ineffective to improve aggregate teacher performance and/or outcomes for children. They do not.

But what does require accurate recording of teacher ineffectiveness?

  1. State of federal intervention in handling struggling teachers.
  2. Humiliation of struggling teachers by public shaming.

Now, Rick doesn't believe that the feds can effectively intervene in this kind of delicate problem, and his logic there applies to most states, as well. So, where does that leave us? Either, the one of the goals of teacher evaluation systems is the humiliation of teachers (individually or collectively), or Rick is simply wrong that crazy high reporting of effective teachers (95%+) are a sign that the systems are not working.

I think that Rick is simply wrong. 


This actually takes us to a common problem with our education policy. For a variety of reasons – some better than others – we want unprecedented amount of transparency in our efforts to improve schools. I don't know of any other field that that calls for the public to know how individual workers are evaluated -- either individually or in the aggregate. Similarly, nor is franchise or branch office performance made public. 

Sure, we all know how a sports team did each game, but no one expects the internal ratings of each members' performance to reported to the public. But politicians to not release the performance evaluations of their staffs, neither individually nor in the aggregate. Researchers do not publicly release their evaluations of their students or their teams, neither individually nor in the aggregate. Think tanks do not release evaluations of their members, contributors or staffs. 

So, why is it that we need to know how many teachers were deemed effective? It is not because without releasing these numbers publicly that we cannot improve outcomes for students. I am not happy with the only reason I can think of.





Our Electoral Primary Process as a Measurement Problem

There is a lot to complain about with regards to our electoral primary process. My wife's favorite complaint -- other than the weird undemocratic nature of caucuses and the unfairness of the same two states going first every cycle -- is that our votes "never matter." We've never lived an an early primary state, and rarely even voted as early as Super Tuesday..

Does this mean that our votes have not mattered? Does it mean that our votes have mattered less than Super Tuesday primary voters? Well, if we think of the primary process as a measurement problem, I think that the answer is, "No." In this post, Let me lay out what that means. Next time, I will explore this view for lessons about measurement in education.

The Construct Being Measured

The goal of the primary process is to select a candidate for the party's nomination for the presidency. The goal is not to find out who is the third most popular candidate. The goal is not to figure out who has the best chance of winning in the general election -- though perhaps it should be. The goal to figure out who the party's supporters (i.e., votes) want to be the party's nominee. 

Challenges Measurement

Every (interesting) measurement problem poses it's own set of challenges. I see three major challenges with this one.

1) Voters and potential voters may lack information about the candidates.

2) Candidates may lack the resources they need to inform voters

3) Voters and potential voters are not distributed homogeneously around the country.

The Key Assumption

There remains a key question that we must answer, because what we assume to be the answer will inform our solution and how satisfied we are with it.

Do the primaries reveal a relativity stable preference of the group, or do the primaries themselves shape and influence the development of a shifting preference?

Nate Silver's original work in 2008 on the Democratic primaries was based on a single insight -- one that I and others noticed as well. While Clinton and Obama's overall share of the votes varied considerably from state to state, their support within demographic groups was remarkably stable from state to state. Thus, given a just a small set of results, he was able to extrapolate future results quite accurately, just based on the demographic profiles of the states. 

This strongly suggests that the underlying trait (i.e., the preference of the group) is stable, and the fluctuations are essentially due to differences in the composition of each sample (i.e., the demographics of each state). 

What About the Narrative?

This stable underlying preference goes against how we have long thought about the primary process. We have believed that there is a narrative there, with earlier results having a causal relationships to later results. A candidate's early wins or losses lead to -- result in -- later wins and losses. That candidates rise and fall because the race is changing. 

But I do not think this is true. I think that the data suggests otherwise. Instead, I think that this a measurement problem. We have some limited tools to access the underlying trait, so we have adopted a system to address those weaknesses.

How the Primary Process Addresses the Challenges

The third challenge -- the heterogeneity in the distribution of voters -- is the easiest to address: we take multiple measurements. We have primaries (or caucuses) in every state. This gives us multiple readings, 

Our primary process also addresses the first two challenges. We begin with a small handful of small states. We give candidates enough time to inform voters in these states, without requiring them to raise massive sums of money. We give these voters time to learn about all of these candidates, and a clear deadline for when they each need to decide. The first two challenges are easy to address in smaller population states and a lot of time. 

This year, we learned in the early states that Martin O'Malley just was not going to do well. That is, with just a couple of measurements, we saw enough to know that he didn't have it. We saw the same with quite a few GOP candidates, and learned where our questions really lied. 

This of this as being an adaptive test -- the kind of test that adjusts which questions you get next based on your performance with earlier questions, narrowing down on where limits of your knowledge, skills and abilities. 

Voters in later contests have fewer candidates to choose from. The candidates have fewer rivals with with to compete for media attention, in raising money and for voters' attention. Challenges #1 & 2 are not as great with a smaller field. 

But wait...!

But does the order of the primaries really not matter?

Let me ask you this? In hindsight, do you think that any of the candidates who have not made it to Super Tuesday could have won with a different ordering of the primaries?

Well, putting a candidate's home state earlier would result in a better result, but would that change anything? Wouldn't any good showing just be chalked up to home-court advantage? How well would a candidate have to do in his/her home state to overcome that? Right now, people are saying that if the Marco Rubio and Ted Cruze do not win their home states, they are just done. The bare minimum for them has to be a win there. The stakes are higher for them there, rather than presenting an easy victories. Victories there will not convince anyone of anything.

What about a demographically different state? Well, if O'Malley couldn't get 15% in either Iowa or New Hampshire, does anyone really think he could actually win elsewhere? 

In hindsight, can anyone name a single candidate (in any year) who could really have won their party's nomination, but for the ordering of the primaries? 

Identifying and Answering the Real Question

The early primaries (and caucuses) narrow the question, winnowing the two fields. We learn whether the race is close, and the process allows voters to focus on the truly viable candidates. 

Do the Democrats want Bernie Sanders or Hillary Clinton? Do the Republicans want Donald Trump, Marco Rubio or Ted Cruz? With just four measurements, we've already been able to focus much more tightly than before. 

Super Tuesday will be about those those questions. And if we do not get definitive answers, we will have more measurements (with more tightly focused questions) to learn more. When we eventually do get definitive answers, we play out the string, let the eventual nominee rack up wins, and build up a delegate majority. 

But, but, but... 

No, many voters in later states will not get to vote for their absolutely favorite candidate while thinking that s/he has a chance to win the nomination. On the other hand, every one of them will get a chance to vote for the eventual winner, perhaps to vote for the last rival to that winner, or at least to signal their disapproval of the eventual winner. If the race stays close, they will be able to decide if they want to select between the last viable candidates, or to cast a more symbolic vote -- perhaps through write-in -- for their old favorite. 

You see, the system is not designed to give voters a chance to vote for whomever they want -- though they can write-in anyone they want. Rather, it is designed to select one nominee who best reflects the preferences of the party. An an exercise in measurement, it actually works pretty well. 

Next time, I will explore how these ideas might be applied in educational measurement.

Coming in 2016

The AleDev blog, More Thoughtful, will launch later in 2016.