Just this week, Chalkbeat’s Matt Barnum asked about the meaning of NAEP’s apparent use of a single scale to report all of its test result. This topic — vertical scaling — reveals problems with vertical scaling. This example makes easy to see.

On one hand, the gap between 4th and 8th grade scores is around 40 point (i.e. ten points a year—a helpful rule of thumb). Although this has shrunk over time and may (probably?) not be consistent from grade to grade one. pic.twitter.com/yAcUmUFuWp
— Matt Barnum (@matt_barnum) October 6, 2022

What is a Vertical Scale?

While the same set of grades are reused across grades (e.g., either the A-F system or the 100-point scale), this is not always done with reporting on standardized tests. Though people understand that an a student who just earned a B+ in 10th grade knows much more than a student who just earned an A- in 5th grade, some people want to highlight that there is this longer continuum across the grades. They even want to compare performance of students (or collections of students) across grades. That is where vertical scaling comes in.

With vertical scaling, we do not have to reset our understanding of the reporting scale for each grade. Instead, the scale just keeps going up. So, the average 2nd grader might score in the 140’s, and average 3rd grader in the 160’s, an average 4th grader in the 190’s, and so on and so on all the way up to the average 11th grader in 620’s. It’s a VERY long scale, with lots of overlap between grades.

There are generally defensible techniques for doing this — though they rely on problematic assumptions. Vertical scales are very important to support various policy goals and evaluation approaches. More simply, though, they support more kinds of comparisons — even comparisons of how much a single child learned one year vs. another year or how much two children in two different grades learned.

The key to vertical scaling is the use of anchor items. Anchor items allow the linking of two tests — across multiple forms of a tests, across different years, across different grades. By reusing a handful of items on each test, they can act as a kind of splice that enable comparisons across tests. That is, comparisons of items across tests. So, if they quantify the performances of test takers on those anchor items on each test, they can use them as a common baseline to link performances across all the items on each test to each other — regardless of which test the items are on.

In the context of vertical linking, they take some of the harder items on the lower test and some of the easier items on the higher test and make sure they are all on both tests. (They do not have to be the easier/harder items, but I think the logic works better when they are.) These shared anchor items provide the psychometric bridge to create a single reporting scale for both tests. Do that with all the gaps between each grade and you can get single scales for the entire span of K-12 education.

Unfortunately, I don’t buy it.

Unidimensionality’s Basic Falsehood

Unidimensionality is the idea that whatever this is that we are measuring, we really are measuring just one thing. That is, if this is a math test, so we are measuring math. We can basically treat each item as contributing equally to the score because each item measures one unit of math. We can summarize performance with a single score on this 3rd grade math test because 3rd grade math is just this single homogenous thing.

The problem is that 3rd grade math is not a single homogenous thing. 3rd grade math is MANY things. Common Core has 5 different domains in 3rd grade math, comprise of 25 different Math Content Standard. If one counts all the individually broken down subparts of CCSS’s 3rd grade math standard, you get 33. Of course, there are also the eight Standards for Mathematical Practice.

How can we report a 3rd grade math as a single score when it has all those different parts? We know the parts are different because the content experts tell us that. We know that different kids have trouble with different parts. We know that they are different grain sizes — even just between the Standards for Mathematical Practice and the Content Standards.

The Reporting Compromise and Its Unexamined Assumption

There is such utility in reporting performance unidimensionally, we simply have to find a compromise. Now, this is a compromise that we have all long been comfortable with. After all, we accept report cards that give students a single grade for math, a single grade for science, and a single grade for each course they take. We accept that in test reporting as well.

The compromise is acknowledge that there are different standards, so the reported score is a composite score. 4 parts this domain, 3 parts that domain, 6 parts this other domain. It is like a teacher who says that grades in their class are made of up of:

30% homework
30% projects
20% tests
20% class participation

Because standardized test reporting impacts so many thousands or millions of students, those composites should be designed very carefully. They should properly weigh the different elements of the entire content domain because different weightings will yield different results. Different weightings will encourage teachers to focus on different parts of the curriculum. Different weightings will favor or disfavor different students, different teachers, different schools and different instructional approaches.

Thank god, the developers and sponsors of standardized tests know that the weightings matter. They try to be thoughtful about them. However, they may not be thoughtful enough. They may be too driven by convenience and too accommodating of the limitations on the tests (and of how those limitations drive the weightings). But no one takes designing a test blueprint lightly. Nonetheless, there is always something arbitrary about the weightings, as there is no definitively correct answer and there are so many factors that influence blueprint design that have nothing to do with the content domain itself (e.g.., item type limitations, scoring budgets, seat time limitations, etc.).

Unfortunately, the real unexamined assumption is that the items themselves actually measure what they purported measure. There is very little work on making sure that items do not individually produce false positive or false negative results. That is, whether students can solve them without using the targeted standard or might fail to solve them for reasons other than lack of proficiency with the targeted standard.

This lack of care with item validity (i.e., items that elicit evidence of the targeted cognition for the range of typical test takers) undermines the thoughtful work of designing the composite that a test’s blueprint promises. If the items don’t measure what they purport to measure, the elements of the composite are not properly weighted. Some elements might not even be represented at all!

This leads to scores who meanings are uninterpretable — unless we just accept that the blueprint and details of the composite’s weights do not really matter. After all, 3rd grade math really is just one thing, right?

Problematicly Assuming Unidimensionality for Vertical Scaling

Vertical scaling necessarily assumes unidimensionality. It has to. Even if the composite was crafted incredibly wisely and the items each actually was perfectly valid, successive grades would have different composites. Some subdomains are more important in 3rd grade math and others more important in 4th grade math. Eventually, lower level content is taken for granted so that higher level content can be focused on. For example, while arithmetic is always important, the importance of interger addition on tests fades as more advanced arithmetic is covered and eventually the importance of arithmetic fades and algebra and other more advanced topics gain focus.

If the composite changes, what does it even mean to link scores between them?
If we acknowledge that the summative score is made up of different subdomains, how many anchor items do we need to link the subdomains across grades?
If a new subdomain appears at some grade, what does it do to the very idea of linking scores across grades?

The only way to resolve these (and other) issues is to hand wave them away and assume unidimensionality.

Back to NAEP’s (facially) Vertical Scale

The National Assessment of Educational Progress — “the nation’s report card”!! — makes no such claim. It does not claim to be a vertical scale. It does not claim that 4th grade scores can be compared to 8th or 12th grade scores. It does not claims a two-point increase in 8th grade means the same thing as a 2-point increase in 4th grade. It does not claim that high enough performance on the 8th grade test would mean more advanced average proficiency than a very low performance on the 12th grade test.

Not at all. it is not a vertical scale. But the three grades are reported in a way that looks like it might be a vertical scale.

But here is how we know it could never be a vertical scale: You cannot anchor items between two levels so far apart. If the items on the 4th and the 8th grades test each actually represent appropriate grade-level standards, we should not expect that any decent number of 8th graders would get the 4th grades items incorrect. Nor should we expect sufficient 4th graders to get any 8th grade items correct. Certainly not enough to splice the two test’s scales together.

This is not about how smart the 4th graders are. Rather, it is that they simply have not been exposed to the 8th grade content, yet. Any signal (i.e., information about 8th grade math skills) in that data would be overwhelmed by noise (e.g., test taking savvy). Similarly, 8th graders who get 4th grade items incorrect might be far more likely do so because they misread the item, rushed or were sloppy than because they lack the content expertise. Again, the noise of construct-irrelevant factors would overwhelm any signal of some 8th graders’ lack of proficiency with 4th grade content.

You simply cannot link tests that are so far apart because you cannot ask these students the same kinds of questions.

The Point?

Well, I see two important takeaways.

First, I find Matt’s question disturbing because he works for a very good education-specific news site and his beat includes both education policy and education research. Among scholars I respect, he is well thought of. No question, he knows a lot for an education journalist.

And yet, even Matt did not understand this. I’ve no idea how many times he has reported on NAEP scores and use of testing has been one of the dominant themes in education policy for decades. If Matt does not understand this, then what does that say about the rest of the media? What does this say about our elected leaders, about parents and about voters?

Second, whenever I challenge psychometricians about their assumptions of unidimentionality, they retort that their methods are robust to some amount of multi-dimensionality. They report that their statistical methods do not break down when faced with data that is not stricitly unidimensional. Of course, I accept that. But that does not mean that the results the report mean at all what they think they do. Validity is about “interpretations of test scores for the proposed uses of tests” (The Standards for Educational and Psychological Testing, 2014, p. 11). Even if the statistics yield a result, the use and acceptance of vertical scales — even if only the suggestion of a vertical scale with NEAP — shows how little considerations psychometrics gives to validity.

I suppose that there’s a third takeaway, though it is less far-reaching. Matt’s question about NAEP scores has long since been addressed. In 2012, David Thissen wrote about the question of the NAP and vertical scales. “The conclusion of this essay will be that evidence can and should be assembled to support, and make more precise, interpretations of the first kind (“one year’s growth”), while interpretations of the second kind (cross-group comparisons across four-year spans) should be discouraged.” This work was done under contract with the publishers of NAEP, and yet it has take up neither of his suggestions. They should do better.

Complex Variety: Assessment Development, Education and Occasional Other Topics