What is the Purpose of Educational Measurement?

Well, that’s a bad title. I mean, I know what the purpose of educational measurement is. It is to report on status of test takers’ proficiency with particular knowledge, skills and/or abilities (i.e., KSAs), and perhaps to report on improvements in their proficiencies (i.e., learning).

And it is to do so in quantitative terms. Not all educational assessment is about quantification, but educational measurement is. I accept that.

So, what I am really wondering about is the scholarly, academic and researchy field of educational measurement. This field includes professors and others at universities, vast numbers of professionals working in industry (i.e., in both for-profit and non-profit organization), folks working in government departments of education (i.e., local, state and federal), and even solo practitioners (like me).

I am asking about the researchy stuff. I am asking about the purpose and goals in advancing the field of educational measurement. This is the stuff of academic journals and a variety of types of conferences. No, this is not the every day work of teaching or developing tests. Rather, this is the most creative and intellectual part of the field, where the state of the art gets created and pushed further. Where the field learns, grows and advances.

What is the learning and growth oriented towards. What is its purpose? It’s a somewhat large field, so I suppose that there can be a lot of goals, depending on the particular interests of the researcher and grant makers.

So, let me come at this from another direction.

Since the original edition of the handbook Educational Measurement (Lindquist, 1951), we have seen huge advances around the world in educational attainment and equity. Simply vast. In the United States we have seen an incredible lowering of the drop out rate, even as we have created state standards and even raised those standards.

I am not questioning the contributions of educational measurement to those advances, at least not today. Rather, I ask whether the advances in educational measurement in the last 70 years have been important to those incredible advances in education rates around the world?

If they have, I would love to know how. And if they have not, which I strongly suspect is the case, why not? What have 70 years of advances in the field of educational measurement been for, if not improving education for students, for communities and for nations?

I would really like to know.

Distractors Matter: Manipulating Item Difficulty with Distractors

One might think that the main determinant of a multiple choice item’s difficulty is the set of KSAs that a an item is targeting. One might think that item difficulty can be spotted through an examination of the stem (i.e., the item’s question or prompt). But one would be wrong.

The most important determinant of item difficulty is the distractors (i.e, the incorrect answer options).

An item without plausible distractors is going to be an easy item. That is, an item whose distractors can all be quickly and easily dismissed—even by those without the command of the targeted cognition—will be easy. We call this low bar for plausibility shallow plausibility or surface plausibility. Distractors must at least be shallowly plausible, and yet they often are not.

An item whose distractors are all shallowly plausible and deeply plausible will be a more difficult item. Deeply plausible distractors are those that require working through the item to dismiss, because they follow from mistakes in applying that targeted cognition.

The most difficult items often have distractors that are quite similar to the key (i.e., the correct answer option). They might differ in some subtle way from their corresponding key. They might rely on a minor point in a text to differentiate. They might be a good answer option, just not the best answer option. For example, they might not be false, and yet they might not contain as much truth as the key. Therefore, they might look like a good answer to a test taker who does not check all the answer options.

None of these possibilities involve changing the targeted cognition, the stem or the key. And yet, these different sorts of distractors or distractor sets can radically alter the empirical difficulty of an item.

Heck, distractors are so powerful that they shift the meaning of the evidence that an item collects from the targeted cognition to some other KSA(s).

Distractors Matter: Answer Option Order and Cognitive Complexity

Multiple choice (MC) items are not merely questions with a bunch of supplied answer options. They do not operate like constructed response (CR) or open-ended question. Unfortunately, too few people understand and appreciate how important answer options are to understanding how people respond to MC items.

While the famous Haladyna, Downing and Rodriquez list of item writing guidelines say that answer options should be placed in a logical order, they do not address the impact of that oder on how test takers work through an item. Sure, answer options could be ordered by length, put in alphabetical order or some sort of chronological order. But whatever rule one follows, it can have unintended impacts on the cognitive path that test takers work down to come to their choice.

Simply compare the cognitive path of placing the correct answer (i.e., the key) in the first position or in the last position.

For items such as mathematical calculations, putting the key first allows the test taker to skip all the other answer options entirely. But if the key is the last answer option, the test taker must consider (and perhaps compare to their answer) each of the other answer options before recognizing the key at last.

But if the item is perhaps less black and white, the test take might have to try to interpret and make sense of answer options, comparing each to their own thinking. If the key is first position, the test taker can quickly come to a sense that it matches their answer, select it and then move on. But if the key is last, the test taker has to figure out whether each distractor (i.e., an incorrect answer option) means what they are looking for, or whether it means something else. As they move through the list, they might lean a little bit more into a “Well, does it kinda mean the same thing?” sort of thinking.

Clearly, the order of answer options can impact how long it takes test takers to work through an item, and the sort of thinking they need to do—even without changing anything about any individual answer option. Different test taker strategies can also influence these, but distractors matter.

Obviously answer option order impacts the how test takers respond to items, right?

What is "Disciplinary Arrogance?"

Some disciplines seem more arrogant than others.

By that, I mean that some disciplines seem more willing than other disciplines to take their toolbox and lens and apply them to problems that they were not created to address.

By that, I mean that some disciplines seem less aware than other disciplines of other related disciplines and their toolboxes and lenses.

By that, I mean that some disciplines seem more dismissive than other disciplines of the answers and discussions generated within other disciplines.

Obviously, no discipline is intrinsically arrogant or humble. The tools, lenses and filters of any discipline lack anything like arrogance or humility. To be honest, those are just the wrong traits to apply to a discipline.

But by disciplinary arrogance, I do not merely mean that some people are more arrogant about their discipline than others. I do not refer to individuals. Rather, I think that it is something cultural, something that exists within communities and social groups, is shared and is passed on to future generations.

Perhaps the most arrogant discipline is economics. Economists seemly think that their toolbox applies to all problems and can generate useful—and perhaps even wise—answers to virtually any real world question. Heck, economists have even named their toolbox (i.e., “econometrics”) to make it easier for others to use.

Well, they actually took a bunch of statistical tools used across many disciplines and redubbed them collectively “econometrics.” Even when those statistical techniques are applies to data that is not economic in nature, economists still call it econometrics—as though they invented the tools.

Obviously, disciplinary humility would be the admission that the tools and lenses of a discipline are not the best tools to analyze a problem or situation. Once again, this is not a trait of the tools, but rather something cultural across the membership of a discipline.

Economics is not the only arrogant discipline. Clearly, arrogant disciplines perpetuate their attitude as novices are acculturated and educated into the discipline. Therefore, it is something that can be addressed ore even moderated, were the field to believe it appropriate to do so.

But how likely is that?

The Ambiguity of Item Difficulty

In the world of standardized assessment, item difficulty is empirically measured. That means that it is not a reference to conceptual difficulty of the  KSAs (knowledge, skills and/or abilities) that the item draws upon. Nor is it a reference to how advanced those KSAs are thought to be.

Rather, item difficulty is measured. It is the percent of test takers (or field test takers) who answered the item successfully. The math is a little more complicated for polytomously scored items (i.e., items for which test takers can receive partial credit), but the same basic concept holds. The statistic, p, is simply the percent of available points that test takers earn across the population. 

This makes the calculation item difficulty easy. However, it makes the meaning of item difficulty rather…ambiguous. 

Imagine some chemistry or mathematics item with a low p. That is, a relatively more difficult item. What does that low p tell us?

  • Could it be that item is drawn from later in the course sequence, so test takers have had less time to practice its notable KSAs it and build upon them? Perhaps so late that some test takers’ classes had not covered it yet when they took the test?

  • Could it be that the item is from an early topic, but is a particulary nasty application of the KSAs? That is, an application requiring a greater level of skill or understanding with the KSA(s) to answer successfully, as opposed to a more routine application?

  • Could it be that the item combines a such a variety of KSAs that it provides an unusually large number of opportunities to reveal test takers’ potential misunderstandings? That is, different test takers might fall short for a range of different shortcomings in their KSAs?

  • Could it be that item has speededness issues. That is, the item takes longer to complete successfully than most items, leading many test takers to simply—and perhaps wisely—to bail on it in order to use their time more efficiently. 

  • Could it be a multiple choice item with a very tempting alternative choice. That is, a distractor that a very common misunderstanding in the targeted KSAs perfectly captures?

  • Could it be a multiple choice item with a different sort of very tempting alternative choice? That is, a distractor that a very common mistake that is not tied to the targeted KSAs perfectly captures?

  • Could it be a multiple choice item with yet another sort of very tempting alternative choice? That is, an unintentionally ambiguous distractor that many test takers read as a correct answer option, even though the test developers did not intend it to be correct?

  • Could it be a multiple choice item with the converse problem? That is, an unintentionally ambiguous key (i.e., intended to be the correct answer option) that many test takers read to be an incorrect answer option, even though the test developers did not intend it to be incorrect?

  • Could it bee that the item presents usual language to many test takers? That is, an item whose instructions are different than how many teachers explain that sort of task, such that many test takers are not quite clear on what is being asked of them?

  • Could it be that the item has unrecognized fairness issues? That is, an item that includes some sort of construct-irrelevant and culturally-based knowledge? For example, use of some language that is well known to urban item developers and test takers, but not to exurban or rural test takers (e.g., bodega, bike path). 

  • Could it be that the item targets KSAs that students often have more trouble learning or mastering? That is, the item’s low p is actually a reflection of the difficulty that students have in learning a particularly tricky or subtle lesson—something that is generally well known by teachers. 

Yes, some of these explanations suggest a poor quality item. Three of them are clearly items that should not be used, because they are bad items. Two others present debatable cases about whether they are bad items. I believe that one of those is a bad item, but the other is a question that the client would need to settle. But the other six explantions are not bad items. Whether the are appropriate for a test is a question of expert judgment that needs to be calibrated against the intentions for the test.

(And none of these explanations are about the different topic of cognitive complexity, though it is of conflated with item difficulty.)

So, you see, measuring item difficulty emprically is not sufficient to understand the item. Like all psychometric tools, it is not capable of providing test takers, students, teachers or policy-makers the kind of information that they need to improve instruction and/or educational outcomes for students. It does not even provide information about the capabilities of test takers (i.e., aid in criterion-based reporting). Rather, it is entirely oriented towards comparing test takers to each other (i.e., to aid in norm-based reporting), with shockingly little reference to the targeted constructs. 

The misunderstood relationship between validity and reliability

The foundational psychometric mistake is that they behave as though—and perhaps believe that—reliability is a sort of fertilizer for validity. That is, in practice the mistaken disciplinary view seems to be that reliability leads to validity. But that has the causal relationship backwards. In fact, validity leads to reliability. But validity is not the only factor that can lead to reliability, and that is where the problems come in. Efforts to increase reliability can be orthogonal to validity, or even come at the expense of validity.

Read More

Now Think?


There’s this moment that still looms large in my memory and in my current thinking from back when I was a sophomore in high school (i.e., back in the dark ages). I had just switched to a different French class mid-semester and I really liked my new teacher. This was my least favorite subject in high school, and the one I struggled with most. Madam asked a question of a student and while he was formulating his answer, she turned to the rest of the class and said, “Maintenant, pensez!” Or maybe it was, “Maintenance pensez, tout le mond!” Now think, everyone. I THINK she went on to say, in French, that she might call on anyone next.

This moment has stuck with me because it had not previously occurred to me that people might not be trying to figure out the answer for themselves while our classmate struggled. Weren’t we all already thinking? I thought that her reminder was funny, for being so unnecessary. 

Last week, I was listening to a Ezra Klein Show podcast with guest Ethan Mollick on how to use various AI and LLM tools, today. They were talking about merely wanting a correct or plausible answer, as compared to the hard work of thinking through a problem. Ezra keeps mentioning from episode to episode that generating an early draft is about thinking, and having an AI or intern do the work could give him a draft but it wouldn’t help him to figure out what he thinks or how he should think or whether he needs to rethink something. Ethan remarked, “People who like thinking like thinking.” 

This is a challenge in my work, has always been a challenge in my work. This was true when I was teaching. This was true when I worked in IT. And it is true as a researcher and in my assessment development work. It most definitely is true as a coach. How do we get people who might not actually like to think to actually think?

What is our future, with expanded artificial intelligence tools? It would be great if they could take some of the drudgery from our plates, but it seems that many people hope that AI can do the thinking for them. It seems that many people think that others would rather not think either, and the best thing that AI can do for us is to take responsibility for thinking and then just give us answers, results and shortcuts. That is not what I mean by drudgery. Of course, I—and all of my favorite people—like to think.

I don’t seen any gain for society by catering to people’s reticence to actually think. We need more thinking, not less. We need more care and deliberation in assessment development, and most every other field. 

Why We Don't Love ECD Evidence Statements

RTD is inspired by ECD, and we love the idea of thinking about item and test results as evidence—and so much that it implies. And yet, we do not love ECD’s structure of evidence statements (i.e., descriptions of what evidence of the targeted cognition might look like.)

The biggest problem that we see in ECD is that it calls for evidence, but does not offer any theory of evidence. Hence, RTD had to develop its own Small Theory of Evidence, the quality of evidence produced by an item or test is inversely proportional to its ability to produce or support Type I and/or Type II errors. That is, assessments and their items should not support false positive inferences or false negative inferences. 

Unfortunately, evidence statements—often inspired by ECD—do not account for the quality of the evidence they describe. Yes, such traits or qualities in test takers’ work products could be evidence of proficiency with the targeted cognition, but is it actually strong evidence in this case? Or, in this case, is it instead evidence of some other cognition. For example, is it instead evidence that the test taker recognized that they could plug the answer options back into the equation to see which one worked (i.e., back solving) instead of evidence that they solved the equation using the targeted cognition?

Evidence is often merely suggestive, instead of being proof in itself. Evidence is often ambiguous, and for that can be useful—to a limited degree. Evidence is rarely proof, instead it really needs corroboration to disentangle the ambiguities it suggests. This is the continuum of evidence quality. 

However, evidence statements do not acknowledge this ambiguity and are often confused with descriptions of proof of proficiency with the targeted cognition. Then, they understandably supplant the targeted cognition as assessment targets. Once that happens, Campbell’s Law kicks in. The evidence statement proxy replaces the underlying construct, and item developers target the proxy in whatever most convenient and efficient way they can. 

Efficiently targeting a proxy can improve reliability, but it comes at the expense of validity because the most efficient route to a proxy can be one that does not go through the actual construct. That is, the efficiency requirements of larger scale standardized tests hone that efficiency in addressing the wrong target, seriously degrading the validity of the inferences and decisions made based upon such an assessment. 

Evidence statements can help to identify potential evidence in a large volume of test taker work product, but that process then requires some other construct or procedure to evaluate that potential evidence for its actual quality. Alas, ECD does not offer that second structure, and test developers’ drive for efficiency can ride evidence statements to rather questionable levels of validity. Retrofitting the evidence statement structure to address this problem (i.e., what we call robust evidence statements) is cumbersome—likely beyond any practical use.

Thus, if evidence statements enable increasing reliability at the expense of validity, test developers need a structure that focuses on validity—on producing evidence of the targeted cognition. This is where RTD item logic comes in.

Who is to Blame for Test Results?

Not that long ago, we were caution by a very smart and thoughtful expert not to report—or perhaps even look for evidence of—misunderstanding or misapplications of the construct or targeted cognition of an assessment. They were concerned that doing so would have the effect of blaming the test taker (or student) for their lack of proficiency. We hear this idea from time to time, that tests should only report what test takers cando or what they do know.

We find this suggestion incredibly destructive of every meaningful purpose for an assessment, including informal assessments.

First, the most important thing that a real teacher can do is to recognize out what a student misunderstands, figure out the nature of their misunderstanding and then provide guidance and support that get them to greater understanding—and even proficiency or mastery. No, mere lecturers and explainers do not have to do this, but that is the difference between a teacher and those far easier roles. Formative assessment is all about looking for those misunderstandings so that teachers can do this special part of their jobs. Assessments must be designed to help teachers with this, and that cannot be done without looking for evidence of those misunderstandings and misapplications.

Second, there is nothing in reporting shortfalls from desired levels of proficiency that assign blame. We do not blame children for being physically short. We do not blame children for not being read to by their parents. We do not blame students for lacking eyeglasses, or for needing them. We do not blame any students for poor instruction, poor curriculum or a lack of appropriate classroom materials.

Yes, it is possible that some students have failed to study or do their homework, and perhaps most of them bear responsibility for that—but not even all of them. Yes, some students are responsible for not paying attention in class, but some distractions are beyond the ability of students to ignore (e.g., an ill family member).

There are so many reasons why a student or test taker might fall short of expectations or our desires for proficiency, and while some of them may fall at the feet of the student or test taker, most of them simply do not. Even disappointing shortcomings in the ease for learning particular types of things (e.g., my own klutziness and lack of straight memorizing abilities) are rarely something to blame students or test takers for.

This gets to the myth of meritocracy we see too often in education. Student success and accomplishments are driven by much more than student effort or even some conception of student ability. Parents, teachers and other influences bear so much responsibility for student successes (and shortcomings) that it would be insane to ascribe it merely to students’ own merit. Moreover, to the extent that there is some sort of innate ability level, it is not as though students earned that.

No, there is no blame involved in looking for evidence of or reporting shortfalls in student proficiency, just as students do not deserve credit for their very really accomplishments that are built upon their lucky advantages.

The Reading Wars and Kanji

Kanji is a writing system used in Japan and grounded in a very similar Chinese system. Rather than a small set of letters based on the sounds of words, it is a vastly larger set of characters based on the meaning of words. Students have to learn 2000-3000 characters in school, but there are upwards of 50,000 different kanji characters—though approximately half of them are more technical or otherwise largely confined to use in narrow contexts.

(Kanji characters can be combined to form a word, as the kanji for “Kyoto” is two characters, “capital” and “city,” because back when Kyoto was first written about, it was the capital of Japan. So, students must not only learn thousands of kanji characters, they must also learn the combinations used to write thousands more words.)

38 Words Written in Kanji

Japan also has a couple of (phonetic-based) alphabets, each larger than our own. But those alphabets have not supplanted kanji. They may be used alongside kanji, but kanji is the foundation of reading and writing in Japanese. (I will stop giving different links for kanji, but it is a fascinating topic that is far far far more complex than I have suggested.)

Now, our reading wars and claims around the so-called (and oft misunderstood) science of reading really boil down to how much instruction should a) lean into the use of our phonetically-based alphabet to sound out words when reading or b) push students to the more advanced recognition of words when reading (i.e., sight reading). I think it that it is pretty obvious that no educators actually advocate for purely-phonics-based instruction, just as none are against the inclusion of phonics; it is question of where the appropriate balance is.

It occurred to me this week that Japan simply cannot have these reading wars. Their primary writing system simply requires pure memorization of thousands of kanji characters. There is no fallback of sounding out words written with a kanji character. There is no fallback of sounding out words written with multiple kanji characters. Yes, one can sound out words written in hiragana or katakana, but not words written in kanji.

I wonder what this does to opportunities for academic success for Japanese students. I wonder if the challenges of memorizing kanji—both for writing and for reading—explains how much studying Japanese and Chinese students are so famous for doing. And I wonder how much we could learn about reading and writing instruction that might inform our reading wars if we looked at reading and writing instruction in Japan and China.

The Exception that Proves the Rule

There are a handful of expressions that used to contain some real wisdom, but in being shortened have become so inane that they even contradict their original wisdom.

For example, the original expression was “Imitation is the sincerest form of flattery that mediocrity can pay to greatness.” (Well, actually that’s Oscar Wilde’s version. Ironically, the original idea came from someone else in a somewhat different form.) Wilde’s expression made clear that the imitator marked themselves as merely mediocre, simply for imitating. This is an enormously condescending insult—Oscar Wilde’s wit, you know. But the shortened modern version, “Imitation is the sincerest form of flattery,” takes away everything insulting and condescending about the original. It takes away the tension in Wilde’s construction and refashions it into a kind of sincerity that means something different. It excuses imitation as being a a good thing—missing the moral valence of the word “flattery.” Heck, on the school yard it attempts to defang imitative mockery into some sort of compliment. Shortening it misses the point—and the wit!

The topical expression this month is “The exception proves the rule.” You see, that its not actually the real expression. The full expression—a legal explanation—is “The exception proves the rule in cases not excepted.” This idea is not the vapid suggestion that if there is a rule that there must be exceptions or that the existence of something that breaks a pattern serves to underscore the existence of a rule or pattern. No, that is all nonsensical.

What the original expression means is that if the law lists some exceptions, then there must be a rule that covered everything else. That is, even if the rule is not listed explicitly, the existence of the explanation of exceptions is enough to prove the existence of the real—though implicit—rule.

Section 3 of the 14th Amendment to the US Constitution reads:

No person shall be a Senator or Representative in Congress, or elector of President and Vice-President, or hold any office, civil or military, under the United States, or under any State, who, having previously taken an oath, as a member of Congress, or as an officer of the United States, or as a member of any State legislature, or as an executive or judicial officer of any State, to support the Constitution of the United States, shall have engaged in insurrection or rebellion against the same, or given aid or comfort to the enemies thereof. But Congress may by a vote of two-thirds of each House, remove such disability.

That last sentence, which I have italicized, is an exception. It explains how to create an exception to the general rule of prohibition. As so many have said, this sentence proves that no legislative action is needed to enforce the rules. In legal terms, the prohibition is self-enacting.

What is ECD's View of Evidence?

However, we’ve come to realize that ECD is actually short when it comes to evidence. I do not mean that it lacks procedures there—after all, it lacks procedures everywhere. Rather, ECD has far too little to say about evidence itself.

How do we know what counts as good evidence? How do we recognize evidence? How do we avoid bad evidence?

ECD’s framework includes Evidence Statements (i.e., descriptions of what evidence of the claims might look like in action). And like Task Models and Domain Models, ECD does not explain what evidence statements should look like. That is left to the individual practitioners and/or project teams. But we’ve come to realize that there are some essential problems with this too vague view of Evidence Statements. 

Simple Evidence Statements have a number of significant weaknesses. This is why RTD suggest more robust evidence statements, and creating them in the context of strong Item Logic. 

First, simple evidence statements often mistake an absence of evidence for evidence of absence. Whether the purpose of an assessment is summative or formative, identifying what students do not know and cannot do is at least as important as identifying what they can do. In fact, with formative assessment, it is even more important. This is not about being negative, rather this is about being instructionally minded. It is important to be sure one does not confuse an absence of affirming evidence for the presence of disconfirming evidence.

Second, simple evidence statements are prone to Type II errors (false negatives). This is in part due to the absence of evidence problem, but it is also due to their inability to disentangle different causes for mistakes or errors.

Third, simple evidence statements are prone to Type I errors (false positives). That characteristic of student or test taker work could be present due to that particular knowledge, skill, and/or ability (KSA), but it might be because the test takers took an alternative path that did not depend on that KSA. 

Simple evidence statements likely work best in the context of some sort of portfolio assessment, in which raters are able to review a broader set of each student's or test taker’s work and look for larger trends and patterns. Taken together, the errors in that noisy data can cancel out and the signal of information can become apparent. This is really just a sample size issue; the noisier the data, the larger a sample size is needed.

However, neither formative assessment nor large scale standardized assessment has access to such large samples of a student's or test taker's work for each assessment target. Therefore, robust evidence statements are needed.

Robust evidence statements must include information about the context in which the evidence appears. What sort of directions or instructions prompted the work? Did they specifically ask for this sort of evidence, or did they merely provide an opportunity to develop it? How much scaffolding was present? Did the task allow for alternative paths? ECD talks about the evidentiary argument, and the importance of the source of and context for evidence is well known in that field of law. Assessment should take those issues just as seriously. 

The Value of Negative Information

One of the most famous Sherlock Holmes stories is Silver Blaze, the one in which he deduces the identity of the guilty party because a dog did not bark. Negative information is often worth as much as positive information.

There certainly have been times in my life, however, when I felt that those who judged me were more concerned with negative information that with positive information. They were more concerned with that I did not do than with what I did do. Or were more concerned with what I could not do than with what I could do. As a teacher, I certainly wanted to identify and celebrate what my students could do.

Positive information feels positive and perhaps celebratory. Negative information feels negative and perhaps even mean.

But negative information can be invaluable.

In some places, there is an explicit effort to focus on what students can do, rather than what they cannot do. This even gets to how some people and organizations talk about assessment. Rather than saying. “Assessments should identify what students can and cannot do,” they say “Assessments should identify what students know and can do.” I really do understand this urge. I used to be a teacher and I cared deeply about my students.

And yet, educators often say that what they really need is high quality formative assessments. That is, they need assessments that help them to provide instruction and to identify where students need more instruction. Of course, formative use of assessment requires times to go back and provide additional instruction and support to students after the assessment (and therefore timely results). But it also requires negative information. It requires tests and items that highlight what students do not know and what they cannot do.

This means that items used in formative assessment must be incredibly careful about false positive results (Type I errors). They cannot provide alternative paths to a successful response that avoids use of the targeted cognition. They cannot be so unstructured that it is not even clear what KSAs a test taker used to produce a successful response. Nor can they focus on the integration of skills such that it is not clear which KSAs broke down when test takers failed to produce a negative response.

The kinds of activities that I would rather teach with and want my students to be able to succeed with are not likely to be very useful for identifying what they need further help with.Yes, many students’ shortfalls might be clear, but many students will be able to steer around their weaknesses and lean further on their strengths. This kind of compensatory approach works to actively hide the information that formative assessments are intended to uncover.

The backlash against standardized tests is based upon many ideas, but one of them is surely that standardized tests and their results can feel mean. Such tests are often designed to designed to reveal shortcomings, deficits and lack of proficiency. Alternatives to traditional standardized tests, therefore often focus on the kinds of activities that I would rather teach my students with and with which I want my students to be successful. Such tasks seem to have more potential to feel celebratory. But such test simply cannot provide high quality information for formative purposes (and the information is not really valid for summative purposes either).

Formative assessment is just as demanding as summative assessment. It requires just as much skill and rigor to produce. Though we do not focus on formative assessment in our Rigorous Test Development (RTD) model of practice, just about everything in RTD applies to formative assessment as well as it does to summative assessment.

How and Why Plagiarism Matters in the Academy

Plagiarism is a very important issue in academia—far more important than in other contexts. This is a very different issue than copyright, which is about the law and perhaps money. Plagiarism is something else. 

Plagiarism is about using the ideas or the expression of ideas of someone else without crediting them for it. (I was taught long ago that it also includes the organization of ideas, but I have never seen that really developed.) It is not a matter of using someone else’s idea or words, rather it is using them in an uncredited fashion. The exact same behavior—even the same exact case—can be meaningless and harmless in other contexts, but a major violation in a scholarly context. 

There are two reasons for this. 

First, academia is about what I call the scholarly conversation. This is where we build on the work of others, crediting them for their contributions and then extending, applying or refuting them. Because we are building on the work of others, they have already shown how, why, where and in what circumstances those ideas apply, what they put together to get there, and perhaps laid out some caveats and/or restrictions. We do not replicate all of that work ourselves, because they have already done it. Unless our specific goal is to replicate their work—in order to verify it, perhaps in a new context—we should not try to replicate it. Instead, we give our readers a shortcut, by letting them know where they can find that earlier work.

This allows readers to evaluate the validity of the foundations we are building upon by considering the credibility of those earlier scholars. There are people whom I respect so much that I would likely just accept their conclusions, without needing to go and investigate how they got there. There are people whose work I have previously found so problematic that I do not trust anything built upon it. But most scholars? Well, if I am unsure about the meaning or breadth or application of an idea, I might want to go and learn more about it. Citations to others’ work allow me, the reader, to evaluate the precursors to the work I am reading—and to do so in the fashion I choose. 

Second, citing those who came before allows me to evaluate the scholar and work I am reading. If I can see that they know that they are building on the work of these previous scholars, I can better be assured that they have already considered or taken into account the issues that those previous scholars raised. I can see that they are, in fact, building on those other people’s work. This means that they should not be making mistakes already warned against, retreading on infertile ground, or simply doing more elementary work. If they show me that they already know that previous work, and how they are building on it, I can take them and their work more seriously. 

More subtly, by citing previous scholars, I can usually see the disciplinary, methodological and substantive direction that a work is coming from. It helps me to understand the kinds of concerns that will be explored, the kinds of tools that might be brought to bear and the classes of themes that might be recognized. That is, it gives me notice of what schema I should be activating so that I can more easily make more sense of what I am reading and will get to in this work. 

Now, both of these two reasons matter enormously for those with expertise to recognize and understand the citations in a work. They can seem like minor things to those who couldn’t make use of the citations for these sorts of purposes. But academic writing is aimed that just such an audience of experts. Obviously, this serves as a barrier to the larger public understanding academic work. This is why writing for the broader public is just so very different than writing scholarly works. But that is a different audience, and different audiences should be approached differently. 

Clearly, neither of these two reasons really addresses the importance of correctly indicating when someone else’s words are being used. That is mostly about just politeness. But there is value in clear and/or efficient expression of an idea. We ought to give credit, rather than steal credit, for well crafted explanations of ideas. But in student work, including doctoral dissertations, there is another very important reason to be a stickler for properly crediting the expression of ideas. You see, explaining something in your own words is often how you show that you actually understand what something means, or why it is important. This is why when quoting an extended passage—even the best written extended passages—it is still important to explain its significance. Yes, this helps the reader to pick out the parts you mean to build on, but (perhaps more importantly), it shows the reader that you actually understand what you are referring to. 

Because scholars in the academy therefore must be stringent on this issue with their students, it becomes incumbent upon them to model the behavior they expect from students in their own work. Even a mild paraphrase can be introduced with “As [scholar] explained,…”  And, honestly, I would feel taken advantage of if someone took credit for my phrasings (which I am sometimes proud of). I am so accustomed to giving credit to others in the scholarly community, I expect others to do the same with me.

With all of these reasons to correctly indicate the sources of the words in a piece and the sources of ideas in that the piece builds upon, why not give proper credit and correctly indicate quotations? I can hardly think of a respectable reason, leaving just laziness and sloppiness—which are hardly decent excuses. 

However, I would add that non-experts might not recognize when an expression is really just a standard way to explain an idea. In fact, most of my quantitative methodology classes focused a shocking amount of attention on how to explain in words what quantitative data, results and/or analysis signified. This was taken so seriously that if two people in the same class were give them same data, graph or statistical output, we could very easily independently write the exact same sentences to describe them, and our peers (and other experts) would immediately recognize what is—if not essentially boilerplate plate sort of language—the style what a particular group has been intentionally acculturated into using. I wish that my qualitative methodology courses were as careful about steering us clear of overstating or misstating what our data showed. 

I would also add that this blog is not written in a scholarly fashion or for a scholarly audience. While I sometimes write with lots of citations, I do that much less here. Different form for a different audience, with different expectations. However, I try hard to attribute quotations properly, even here.

Cognitive Complexity: Uncertainty and Deliberation

While cognitive complexity can describe many things, the RTD approach to cognitive complexity is firmly grounded in the assessment industry’s dominant model, Norman Webb’s Depth of Knowledge (wDOK). As we read it, the central thrust of wDOK is the continuum of deliberation-to-automaticity, with the greater cognitive work of more deliberative cognitive paths being more cognitively complex, and the lesser work of greater automaticity—often earned through practice and greater proficiency—being less cognitively complex. (No, this is not the only way to think about cognitive complexity, but we based our rDOK approach on wDOK because it is so dominant in the industry. See our writing on rDOK (revised Depth of Knowledge) to examine how we think this plays out in the various content areas.)

One of our colleagues, a former science educator and now science assessment expert, wisely asked about the relationship between uncertainty and deliberation. Well, there are many kinds of uncertainty, and not all of them are tied to the kind of deliberation that DOK is about. Nonetheless, uncertainty often does lead to greater deliberation and a more cognitive complex path.

  • There is the uncertainty of not even knowing where to start, or whether to start. That is not deliberation. That is just indecision—often paralyzing indecision. It is a general, and common, nervousness that can be a barrier to focused effort. Teachers and tutors are familiar with this and an important part of their role is to help their students to develop the confidence to overcome this kind of uncertainty and take that first step.

  • There is the uncertainty of lack of confidence in one’s execution, which can be entirely rational. Perhaps more people should have this, as it leads to various sorts of proofreading. That is, they review their work for little mistakes in execution, even though this does not include rethinking the whole approach they took. Math teachers say “Check your work,” meaning the the mathematics equivalent of proofreading. This uncertainly is not advanced deliberation, and the greater work it prompts is not indicative of great cognitive complexity. Rather, it is essentially repetition of earlier work.

  • There is the uncertainty of not being sure what to do next when in the middle of the problem, or even not being sure what to do first. That is, once past the paralysis that keeps one from even being able to truly try to make sense of the task, one might still be unsure about the first step. This question of “What do I do next?” is a form of deliberation. It can be answered simply by trying to remember the next step in a (perhaps poorly) memorized procedure. It might instead be answered by trying to (re)discover or (re)invent a good next step. This latter response constitutes reasoning and the kind of deliberation at the focus of both wDOK and rDOK. Indeed, uncertainty is often what creates the opportunity for deliberation. 

  • An even more careful deliberation can be prompted by initial uncertainty. One might try to figure out more than just the initial step, instead trying to work out a longer plan before diving into the work of the first step. This is not necessarily a different kind of uncertainty than mentioned above, but one’s response to it can be less or more carefully and deliberative—and therefore more cognitively complex. 

  • There is also a second kind of uncertainty after completing a task. One might ask oneself, “Was that even the right thing to do?” and revisit/question the reasoning that led to the steps taken. This differs from merely proofreading/checking one’s work, though both are prompted by uncertainty after the fact. Proofreading revisits execution, whereas this revisiting of reasoning is more cognitively complex.

Of course, one might be uncertain before a task and i) carefully develop a plan to help break through initial paralysis, ii) execute the plan, iii) revisit the reasoning of the plan but decide it was a good approach, and iv) when check one’s work. Uncertainly can drive all of this. All of that careful deliberation can still lead to bad plans poorly executed with errors that were missed when proofreading. No amount of deliberation can guarantee success, and the highly proficient can often achieve success without any conscious deliberation. 

Uncertainty can be a product of a range of factors. It might come from genuine ignorance or other lack of necessary skills. It can from insufficient practice or arise due to being faced with novel situations. It can be a product poor instruction or lack of effort to learn by a students. Some people are by character more confident and some are by character less confident—and either be justified or not in this. But regardless of the source of uncertainty, the question of cognitive complexity (i.e., either rDOK or wDOK) is answered by looking at the response to uncertainty. 

On the other hand, a lack of certainty obviously inhibits deliberation. It makes deliberation of any sort far less likely, which is often detrimental to producing high quality work. Ideally, intellectual humility would put an upper limit on certainty and lower limit on deliberation of various sorts.