The Road from Validity (Part II): Cementing the Replacement of Validity

Last month, I explained how large scale human scoring departs from carefully constructed rubrics and procedures to be replaced with something different. The desire for reliability and procedures to favor reliable scorers (i.e., those most likely to agree with other raters) serves to push out those who use the valid procedures and favor those who use the faster procedures.

Of course, we do not live in an age of large scale human scoring of test taker work product. We have replaced most human scoring with automated scoring. There are algorithms and artificial intelligence engines of various sorts that to the scoring, instead. They are be trained by examples of human scoring, but the bulk of the work is done by automated scoring.

How do we know whether the automated scoring engines are any good? Well, their creators, backers and proponents proclaim again and again that they do what human scorers do, only better. Not just faster and cheaper, but better.

What does better mean? Well, they use the same idea of reliability to judge the automated engines as they do to judge human scorers. The make sure that the automated scoring engines have high agreement rates with the pool of human scorers on some common pool of essays. That sounds fair, right? They use the same procedures used to judge human scorers and decide who is invited back to judge the machines.

This means that the same dynamic that replaced validity with reliability—that replaced the valid scoring procedure with the faster scoring procedure—is used to tune the algorithm.

No, automated scoring is not better at using the valid procedure—not faster, cheaper or more diligent. No, it means that automated scoring is better at using the faster procedure.

Of course, once the faster procedure is coded into the algorithm, it is harder to question. This is yet another form of algorithmic bias. Algorithms trained on biased data will perpetuate those biases in perpetuity, rather than highlight them. In this case, it just repeats the biases that make up the deviations from the valid procedure. Whatever was harder or slower or more nuanced in the valid procedure that made it slower is left out.

No, machines do not make mistakes. They do what they are told to do. The human scorers who depart from the valid (and slower) scoring procedures are making mistakes. But the machines are simply doing what they are told to do—trying o match the results of the human scorers.

And their developers and proponents brag on that, as though doing what they are told to do is necessarily doing the right thing. They fail to audit their training data (i.e., the results of human scoring) for validity, trusting in their beloved reliability. So, their algorithms cement the biases and departures from validity and hide them behind claims of reliability—as though reliability is the same thing as validity.

Not intentionally. Perhaps not quite inevitably. But quite consistently. Quite reliably.