Radical Empathy

July 27, 2020

Radical Empathy is a term that we have been using to describe the the most demanding and difficult step of our rigorous Item Alignment Examination procedure. Others use it it in different ways in different contexts. but it is an an important term for us because of it is about a very important tool in the CDP (Content Developer Professional) toolbox.

Radical Empathy is an open-minded iterative process of trying to think through an item (i.e., all the way from the beginning of the stimulus through completing the work product) from as many relevant perspectives as you can. It stands in contrast to consciously working through an item as yourself, or as you remember yourself to have been at an appropriate age.

Empathy is, of course, at the center of Radical Empathy. Reading though an item with empathy is to try to read through an item through the lens, in the shoes and wearing the hat of another person — possibly a person quite different from oneself. Obviously, the more different the person from you, the more challenging this act of empathy is.

What makes Radical Empathy so radical is its iterative and open-minded stance. The point is not merely to capture another perspective. Rather, it is to try your best to imagine the range of perspectives that one might find across the range of typical test takers. You need to take each of those perspectives and work through the item, carefully and consciously, wearing that lens, shoes and hat.

Thus, in addition to thinking from the perspective of a young person — usually from a whole different generation — you must consider different educational backgrounds, different experiences of the world, different knowledge and skills, different amounts of patience and focus, in addition to the more traditional demographic issues of gender, race/ethnicity, English language status, urbanicity, etc..

Radical empathy is about capturing as many of these perspectives as possible and carefully thinking through how these differences in perspective might influence the cognitive path that different test takers take as they work through items. However, any individual’s ability to imagine others’ perspectives is necessarily limited by their experiences with other people. Thus, efforts at Radical Empathy — such as in Step IV of Item Alignment Examination — call on content development professionals to pay attention to opportunities to learn more about others’ perspectives, to be open minded about the existence of perspectives they had not previously considered and to be humble about their confidence in their understanding of the perspectives they had considered.

We think that this is quite difficult. We think that every part of this is difficult — from staying in just a single other person’s perspective all the way through an item to maintaining that humble stance. However, because valid items elicit evidence of the targeted cognition for the range of typical test takers, we believe that this is necessary.

The Inefficiency of Education Spending

June 16, 2016

One of the great criticisms (i.e., oft voiced) of our education system is that spending (per student) has shot up over the past 30-40 years, but test scores have barely budged. While the degree of the spending increase is usually overstated, and test scores have grown more than critics give credit for*, the criticism actually is true.

(*Obviously, there is no clear standard for how much scores should increase, but the real issue there in the cherry picking of one test that has shown the least increase. This exaggerates the lack of increase in test scores, as not adjusting spending for inflation exaggerates spending increases.)

However, this criticism entirely misses the mark. In fact, the moral mission of education requires our system to get less efficient over time.

Though there have been occasional technological break throughs that have greatly increased the efficiency of education (e.g., books), education has always been a very labor-intensive task. While technology can lower the cost of information dissemination, that has long been the easiest element of schooling. So much of the work of schools entails diagnosing errors and holes in students' thinking through examination of their work and what they say, leading to individualized explanations and scaffolding to support their learning. Of course, monitoring and maintaining student engagement and motivation is similarly individualized. And this says nothing of the basic daily childcare role our schools.

There's no reason to expect efficiency gains in these areas, despite our ever-improving technology. But that does not explain increasing costs. Increasing teachers wages in amid a growing economy and standard of living is, of course, a portion of it. But there is something much deeper pushing against increasing efficiency in schooling.

Our schools -- everyone's schools -- historically have not attended to all students equally. We have long tolerated or even encouraged some kinds of students to drop out. Heck, back in the day we did nothing to encourage them to enroll in the first place.

Who have we focused on? Children best prepared to learn. The smartest students. Students from the most stable and supportive families and communities. Students most readily able to learn, who need the least support from their schools. The cheapest students to educate.

But for most educators, the moral mission of our schools is to educate all students. All.

That means that we need to put greater efforts into reaching students who need more support from their schools, students who are more expensive to educate. These can include students with physical, emotional, mental and/or learning disabilities. These can include students whose parents lack even high school degrees. Students from families less able to support their children's educational pursuits. Students from families whose social and economic conditions are more likely to impede than to support school work and learning.

In the last two decades -- under Presidents George W. Bush and Barak Obama -- there has been incredible federal pressure on schools to pay attention to lower-performing students. Schools have been judged by their success at bringing up those students, rather than being able to claim success merely by focusing on students already doing well in school.

The moral mission of our public schools requires them to target the more difficult (and expensive) to educate students. Success in our public schools comes when we make greater efforts to reach these students, to close achievement gaps, and bring up every student to the level they need to be successful in their further education and/or work, after high school.

Of course we want to get better at reaching all students. Of course we would like to find ways to reach all students that are less resource intensive -- in large part because as we free of up resources from some students, we can do a better job of reaching other students. But we have so far to go in the mission of reaching all students that and efficiency gains will -- and morally ought to be -- used to do still better with students whose performance we judge lacking.

Unlike so much of the world -- unlike the world of commerce and profit motives -- the moral missions of our public schools requires educators to seek out what others might call the worst customers, or the least profitable customers. Charter and other private schools can target their marketing and counsel out students they find too challenging to educate, but public school success is often defined as doing well with precisely those students.

Complaints about long term declines in school efficiency are not only misguided, but actually immoral.

Perhaps the Biggest Problem: Misunderstanding Bias and Error

April 26, 2016

"Bias" has a particular meaning in the field of measurement. Fortunately, this means is not that far off from our colloquial/every day use of the term. In measurement, it means systemic error in a particular direction. This meaning highlights the fact that there is another class of error, the kind that is not systemic in a particular direction, "noise."

Unfortunately, too many people -- including too many psychometricians and other professionals in the field of measurement -- do not really recognize bias, and therefore our use of measurement suffers for it.

***************************************

Noise is endemic in any measurement. Measurements are always a little off -- maybe a little high, maybe a little low. I need 2 tablespoons for sugar, and maybe I grabbed 26 grams, rather than 25. Or maybe it was 24. Or 24.2. Or even 22.8.

The more careful we are, the measure our instruments, the better we can do at reducing noise. But we cannot eliminate it. Our measurements will always have some random error component.

In cooking, if we are careful and actually use the right tools, the noise is too small to matter.

In other applications, the noise can matter more. In educational measurement, where we make important decisions for and about children, the noise can be very important. We know this, and we have statistical tools to help us recognize it, to help use quantify it, and to help us to think about how to reduce it.

The primary way to deal with noise is to make longer tests. Seriously. And this works. Because noise is -- by definition -- random, in the long run it will cancel out. Test with more items (i.e., questions) actually lead to less noise in the final score. In this case, more is better. In this case, adding more noise leads to less overall noise, because random error can cancel out.

This is not how bias works.

***************************************

At the annual NCME (National Council on Measurement in Education) meeting in Washington, DC this month, I had a whole bunch of interesting conversations with other people. These usually happened immediately after a session ended, as I spoke with someone I'd never known before about something we heard from one of the presenters.

Prof. Mark Reckase gave a presentation that focused generally on the differences between educational measurement and psychological measurement. In this speech, he mentioned the Hippocratic Oath, the idea that doctors pledge to, "First, do no harm." It would be wonderful if we had that professional ethical standard in educational measurement. After the session, I spoke with another attendee about this. I was saying that if we were to live by that standard, sometimes we just wouldn't test, we wouldn't use a measurement, we might not sell/license a test to some customers. But he didn't understand.

I tried to give an example, speaking of the problem of gender and racial/ethnic bias in job interviews. Unfortunately, our candidate screening processes tend to perpetuate the make up of our companies. People are more likely to see their own positive traits in other people who look like them and who have similar backgrounds. It just is harder for people to see positive traits in people who look different and come from different background than in those who are already similar to them. Thus, an argument could be made that in-person interviews might do more damage than help, and if we lived by the "first, do no harm" standard, perhaps we should just skip them entirely -- even though they are a well established practice. That is, the fact that we have always done them might run right up against the "first, do no harm" standard.

This other gentleman insisted -- over and over again -- that the answer was to do more interviews. That if there was error in the interview process that doing more interviews would compensate for it.

He was confusing random error/noise with bias. He had in his head that bias is just a form of error, and the answer is to let the error cancel out.

But bias does not work this way.

***************************************

Imagine that you have a measuring cup that is off. Image that it is just too small, by 10%.

Each time you use it, there will be a random component to the error. You won't get exactly 0.9 cups every time. You will get a little more or a little less than 0.9 cups. The random error/noise will be in addition to the bias. Therefore, carefully measuring out 32 "cups" to get 2 gallons will lead to a really good chance that the noise has cancelled out (for the most part), but you'll still be around 10% short.

If it bunch of item are individually biased a little bit against girls, then using 32 of them won't fix that problem. It will produce a score that is just as biased against girls.

The answer that the measurement industry appears to use is to add a bunch of items that are biased against boys to the item pool. The thinking seems to be that these biases will cancel out. They want to turn bias into noise, and think that they can make it cancel out.

And they do the same for other forms of bias, too. They think that they can just make the bias cancel out.

Unfortunately, this doesn't work for individual test takers. Even if the strategy was sound, it is not applied for individual test takers. A balanced item pool is one thing, but test takers don't take item pools. Test taker use individual forms, and I do not know anyone who examines individual test for (or adaptive generated forms) for gender bias, or urbanicity bias, or racial/ethnic bias, or SES bias, or any other bias.

Validity problems cannot be turned into noise. Construct underrepresentation -- a huge problem in educational measurement -- cannot be turned into noise. Dumbing down of content for the sake of our testing technology cannot be turned into noise. Lowering the cognitive complexity of items to accommodate time limits on our tests cannot be turned into noise.

These are all biases in our tests. But too often we forget that not all error is noise.

The Right to an Education

March 21, 2016

What does it is mean to have a right to an education? I've been thinking about this, in light of talk this month about the right to free speech.

*********************

The First Amendment to the Constitution is the source of our right to free speech. But clearly this is widely misunderstood. Rather, our speech is generally free from government interference, but not even in that it is limited. Though defamation, slander and libel are difficult to prove in court, the government does limit these. There are many limits on commercial speech, as well. On the other hand, our political, artistic, literary and scientific speech are almost entirely free from government interference.

However, the government does not protect our speech from others. We can be fired for what we say. We can be dumped. We can lose friends. We can lose customers. Our freedom of speech is really just about freedom from government interference or regulation of speech.

On the other hand, there are other freedoms that the government does protect. The government will take action against those who discriminate, for example. We cannot be denied a room at a hotel or a meal at restaurant for the color of our skin or our religion. The federal government will protect that right. We must have a right the same educational opportunities, regardless of our sex. The federal government will protect that right (provided there is any link to federal funding).

Many of our rights are to be free from the government. So, of our rights are ensured by the federal government against others. But what about education?

*********************

Unfortunately, we do not have a right to an education -- at least not on the federal or national level. The word education does not appear in the Constitution, and any federal role in education is questionable. Some argue that is tied in with interstate commerce, and therefore a federal issue. But generally, our acceptance of a federal role in education has nothing to do with the Constitution. It has developed and grown over time, and we generally accept it -- even though some argue against it.

That leaves our right to an education as a state matter. In fact, the right to an education varies from state to state. The feds have addressed special education and both racial and sex discrimination in education, but other than that it is up to individual states.

How much education do we have a right to? What quality of education? How much geographical equality? Equality of opportunity or equality of effort? How do we define effort or opportunity for these purposes? All of these are state matters. All of them.

Our right to an education comes from state constitutions, state laws and state regulations, as implemented by state and local government and as interpreted by state courts. Not federal.

Moreover, there is very very little regulation of traditional private schools. The federal government has ruled that we cannot be compelled to attend a public school if we would rather attend (and pay for) a private school. If private schools take any federal funding, some regulations apply (e.g., the sex discrimination stuff). Otherwise, private schools are regulated like any other business (either for-profit or non-profit). There is a bit more regulation of charter schools, but this might just a product of how much government funding they take.

So, the government is not going to step in an assure your right to an education. No one is going to assure you that your private school is actually providing an education to its students. The federal governed tried to force states to step in with public schools to assure that they are providing a decent eduction, but is now pulling back. (Pulling back from an effort that did a piss poor job of defining or measuring an education, anyway.)

While most (all?) states do promise a right to a free public education, the quality of those schools is far from assured. In fact, it appears that while we may have a right to an education, we do not have a right to decent education.

What is the Point of Teacher Evaluation? Seriously, What is the Purpose?

March 8, 2016

Recently, Rick Hess has written about the pointlessness of new teacher evaluation systems, claiming – in his headline – that they "Don't Make a Difference." But I think Rick might be missing the point. (He is basing this on the research of Matt Kraft and Allison Gilmour, which I have downloaded but not yet read, yet. They might be missing the point, too, but I can't be sure, yet.)

The point. What is the point or the purpose of teacher evaluation? Well, I can think of a few possibilities, but the bottom line of each and everyone one of them would have to be improving outcomes for children. How might teacher evaluation systems do that?

Identify struggling teachers for removal.
Identify struggling teachers for targeted intervention to improve their effectiveness.
Intimidate struggling teachers to remove themselves.
Provide a structure or framework for struggling teachers and those who support them to think about teaching, so that they can better improve their effectiveness.
Provide a structure or framework for all teachers and those who support them to think about teaching, so that they can better improve their effectiveness.

That's basic mechanisms by which a formal teacher evaluation system may improve aggregate teacher effectiveness, and thereby improve outcomes for children.

But I think there are more than that because mechanisms #1 and #2 are ambiguous in who must know the identity of the struggling teachers. If it is just the teachers themselves, then mechanisms #1 and #2 are the equivalent of #3 and #4, but there remain multiple possibilities. Perhaps struggling teachers should be identified to their local supervisors? Perhaps they should be identified to their local peers? Perhaps they should be identified to their district offices? Or to the public, or the the state of feds?

Each of those implies a somewhat different mechanism. Peers might support a struggling teacher in other ways than a supervisor, and systems by which peers decide on the removal of ineffective teachers – usually called Peer Assistance and Review (PAR) – do exist in some places. Certainly, targeted assistance and termination procedures are quite different if they bare based on local supervisors rather than the district office – and likely different again if based on state officials or feds.

So, it's pretty complicated.

But here's where I think Rick (and likely Mark and Allison) is making his big mistake: many of these mechanisms do not require accurate reporting of teacher effectiveness. Many of these mechanisms are not undermined by fudging the public or official recording to the advantage of the struggling teacher.

So long as a teacher, his/her supervisor and/or his/her peers know that this teacher is struggling, the mechanisms based on improving his/her effectiveness can still work.

So long as a teacher knows that s/he is struggling, s/he can still leave. So long as a supervisor knows that a teacher is struggling, s/he can still pressure the teacher to leave. So long as a supervisor knows that an untenured teacher is struggling, s/he can fire that teacher without citing ineffectiveness as a the reason.

Let me say this again: Teacher evaluation programs do not have to accurately record which teachers are struggling/ineffective to improve aggregate teacher performance and/or outcomes for children. They do not.

But what does require accurate recording of teacher ineffectiveness?

State of federal intervention in handling struggling teachers.
Humiliation of struggling teachers by public shaming.

Now, Rick doesn't believe that the feds can effectively intervene in this kind of delicate problem, and his logic there applies to most states, as well. So, where does that leave us? Either, the one of the goals of teacher evaluation systems is the humiliation of teachers (individually or collectively), or Rick is simply wrong that crazy high reporting of effective teachers (95%+) are a sign that the systems are not working.

I think that Rick is simply wrong.

**********************************************

This actually takes us to a common problem with our education policy. For a variety of reasons – some better than others – we want unprecedented amount of transparency in our efforts to improve schools. I don't know of any other field that that calls for the public to know how individual workers are evaluated -- either individually or in the aggregate. Similarly, nor is franchise or branch office performance made public.

Sure, we all know how a sports team did each game, but no one expects the internal ratings of each members' performance to reported to the public. But politicians to not release the performance evaluations of their staffs, neither individually nor in the aggregate. Researchers do not publicly release their evaluations of their students or their teams, neither individually nor in the aggregate. Think tanks do not release evaluations of their members, contributors or staffs.

So, why is it that we need to know how many teachers were deemed effective? It is not because without releasing these numbers publicly that we cannot improve outcomes for students. I am not happy with the only reason I can think of.

Our Electoral Primary Process as a Measurement Problem

February 29, 2016

There is a lot to complain about with regards to our electoral primary process. My wife's favorite complaint -- other than the weird undemocratic nature of caucuses and the unfairness of the same two states going first every cycle -- is that our votes "never matter." We've never lived an an early primary state, and rarely even voted as early as Super Tuesday..

Does this mean that our votes have not mattered? Does it mean that our votes have mattered less than Super Tuesday primary voters? Well, if we think of the primary process as a measurement problem, I think that the answer is, "No." In this post, Let me lay out what that means. Next time, I will explore this view for lessons about measurement in education.

The Construct Being Measured

The goal of the primary process is to select a candidate for the party's nomination for the presidency. The goal is not to find out who is the third most popular candidate. The goal is not to figure out who has the best chance of winning in the general election -- though perhaps it should be. The goal to figure out who the party's supporters (i.e., votes) want to be the party's nominee.

Challenges Measurement

Every (interesting) measurement problem poses it's own set of challenges. I see three major challenges with this one.

1) Voters and potential voters may lack information about the candidates.

2) Candidates may lack the resources they need to inform voters

3) Voters and potential voters are not distributed homogeneously around the country.

The Key Assumption

There remains a key question that we must answer, because what we assume to be the answer will inform our solution and how satisfied we are with it.

Do the primaries reveal a relativity stable preference of the group, or do the primaries themselves shape and influence the development of a shifting preference?

Nate Silver's original work in 2008 on the Democratic primaries was based on a single insight -- one that I and others noticed as well. While Clinton and Obama's overall share of the votes varied considerably from state to state, their support within demographic groups was remarkably stable from state to state. Thus, given a just a small set of results, he was able to extrapolate future results quite accurately, just based on the demographic profiles of the states.

This strongly suggests that the underlying trait (i.e., the preference of the group) is stable, and the fluctuations are essentially due to differences in the composition of each sample (i.e., the demographics of each state).

What About the Narrative?

This stable underlying preference goes against how we have long thought about the primary process. We have believed that there is a narrative there, with earlier results having a causal relationships to later results. A candidate's early wins or losses lead to -- result in -- later wins and losses. That candidates rise and fall because the race is changing.

But I do not think this is true. I think that the data suggests otherwise. Instead, I think that this a measurement problem. We have some limited tools to access the underlying trait, so we have adopted a system to address those weaknesses.

How the Primary Process Addresses the Challenges

The third challenge -- the heterogeneity in the distribution of voters -- is the easiest to address: we take multiple measurements. We have primaries (or caucuses) in every state. This gives us multiple readings,

Our primary process also addresses the first two challenges. We begin with a small handful of small states. We give candidates enough time to inform voters in these states, without requiring them to raise massive sums of money. We give these voters time to learn about all of these candidates, and a clear deadline for when they each need to decide. The first two challenges are easy to address in smaller population states and a lot of time.

This year, we learned in the early states that Martin O'Malley just was not going to do well. That is, with just a couple of measurements, we saw enough to know that he didn't have it. We saw the same with quite a few GOP candidates, and learned where our questions really lied.

This of this as being an adaptive test -- the kind of test that adjusts which questions you get next based on your performance with earlier questions, narrowing down on where limits of your knowledge, skills and abilities.

Voters in later contests have fewer candidates to choose from. The candidates have fewer rivals with with to compete for media attention, in raising money and for voters' attention. Challenges #1 & 2 are not as great with a smaller field.

But wait...!

But does the order of the primaries really not matter?

Let me ask you this? In hindsight, do you think that any of the candidates who have not made it to Super Tuesday could have won with a different ordering of the primaries?

Well, putting a candidate's home state earlier would result in a better result, but would that change anything? Wouldn't any good showing just be chalked up to home-court advantage? How well would a candidate have to do in his/her home state to overcome that? Right now, people are saying that if the Marco Rubio and Ted Cruze do not win their home states, they are just done. The bare minimum for them has to be a win there. The stakes are higher for them there, rather than presenting an easy victories. Victories there will not convince anyone of anything.

What about a demographically different state? Well, if O'Malley couldn't get 15% in either Iowa or New Hampshire, does anyone really think he could actually win elsewhere?

In hindsight, can anyone name a single candidate (in any year) who could really have won their party's nomination, but for the ordering of the primaries?

Identifying and Answering the Real Question

The early primaries (and caucuses) narrow the question, winnowing the two fields. We learn whether the race is close, and the process allows voters to focus on the truly viable candidates.

Do the Democrats want Bernie Sanders or Hillary Clinton? Do the Republicans want Donald Trump, Marco Rubio or Ted Cruz? With just four measurements, we've already been able to focus much more tightly than before.

Super Tuesday will be about those those questions. And if we do not get definitive answers, we will have more measurements (with more tightly focused questions) to learn more. When we eventually do get definitive answers, we play out the string, let the eventual nominee rack up wins, and build up a delegate majority.

But, but, but...

No, many voters in later states will not get to vote for their absolutely favorite candidate while thinking that s/he has a chance to win the nomination. On the other hand, every one of them will get a chance to vote for the eventual winner, perhaps to vote for the last rival to that winner, or at least to signal their disapproval of the eventual winner. If the race stays close, they will be able to decide if they want to select between the last viable candidates, or to cast a more symbolic vote -- perhaps through write-in -- for their old favorite.

You see, the system is not designed to give voters a chance to vote for whomever they want -- though they can write-in anyone they want. Rather, it is designed to select one nominee who best reflects the preferences of the party. An an exercise in measurement, it actually works pretty well.

Next time, I will explore how these ideas might be applied in educational measurement.