Imagine that you have some complex multidimensional trait that you want to test for.
A. You get a bunch of experts and ask them to rate the recorded performances of test takers…only they don’t agree on scores. They rate the performances inconsistently because they do not agree on how to weight the different contributing factors and dimension. They are all experts, so they are not wrong, but you judge the inconsistency a problem.
B. So, you train them. You provide them rubrics and exemplars. You tell them how they should be scoring. You try to standardize their judgments.
C. Then, you want to remove the outlier scorers and keep the ones who are consistent with each other. You want comparable scores, so you can explain what the scores actually mean, what they refer to. You don’t want that mess of inconsistency, which you believe does not actually help anyone.
To ensure that test takers are rated fairly and outlier scorers are caught and removed, you have every performance rated by two different scorers, selected at random. If they agree, that’s good. If they disagree, that’s bad. Over the course of a day or a week, each scorer rates a whole bunch of performances, so you can check their overall rate of agreeing with their randomly selected scoring peers. The ones who tend to agree more are the ones you keep, and ones to seem to march to their own beat more often are not invited back.
D., You are actually running a business or have some sort of budget. You want the scoring done as quickly as practicable. And the scorers? This job is rather repetitive and boring, so they don’t focus as well on each individual performance when they are rating so many of them. These are two different pressures for speed, both top-down and bottom-up. Some scorers go faster, taking little shortcuts and using their own heuristics in place of the full formal procedures laid out in step B.
Is that real scoring expertise at work? Is that developing skill? Or is it replacement of the official procedures with something else a bit faster?
E. Let’s look more closely at what happens to the agreement rates. We’ll use easy round numbers to make the analysis simpler. Imagine that half of the scorers take that faster route, and half go by the officially sanctioned formal route. And imagine that the faster route is twice as fast. That means that the faster route is going to score twice as many performances and the slower group. That means that everyone’s randomly selected scoring partner is twice as likely to be a fast scorer than a formal scorer.
In our little simplified thought experiment, these two groups are using two different methods of scoring, and they sometimes will differ in the scores they report. Whether I am a fast scorer or a formal scorer, 2/3 of time, I will be randomly paired with a fast scorer. I will have better agreement rates if I use the fast scoring method than the formal scoring method. The fast scorers will have higher agreement rates than the slow scorers.
The fast scorers will be retained and the formal scorers will be replaced, and some the replacements will opt for the faster method. This will further lower the formal scorers’ agreement rates and further increase the faster scorers agreement rates.
*********************
What has happened to your construct? What are the test takers performances being judged against? Would you even notice the shift?
Would you notice use of reliability to evaluate scores drove a shift from the formal and documented scoring procedures that were designed to best evaluate the construct to some—perhaps obvious—shortcuts that do not consider all of the dimensions and subtly of the construct?