ChatGPT Results May Be Plausible, But They Are Not Credible

ChatGTP — the new artificial intelligence chatbot that is all the rage — is amazing. It writes plausible text that seems fairly informed about the world and any number of topics. It is so impressive that there are all kinds of people saying that this sort of approach could replace search (e.g., google) when looking for information.

Wow is that a bad idea. An incredibly bad idea. Just dumb. Really really really dumb.

First, and least importantly, I find it shocking how impressed so many people are at text written at a level of a very talented 8th grader. That is, the writing of a really smart 14 year old.

Second, and far more importantly, it appears that people do not understand the role of plausibility and accuracy in what ChatGPT is doing, nor their implications for the kinds of things that they might search for.

ChatGPT does not care about accuracy at all. That’s not part of its programing. That is not how it was designed and not how its designers want it to be evaluated. If you care about accuracy, you need to steer clear of such things.

Instead, ChatGPT cares much more about plausibility. Obviously, it’s just a program, so it does not actually care. And I’m not sure that its designers would use that term, “plausibility.” But I am fairly certainly that they would concede that that is part of the goal. ChatGPT is built upon some vast and broad corpus of text and generates responses that fit the patterns that the machine learning AI has found in that vast and broad corpus. That is, it generates text that rather looks like what it has already found, building on what the user types in in their side of the chat.

So, I gave it another sort of test run. I asked it about my work, about the differences between validity and reliability and how they apply in the development of items for standardized tests. Not surprisingly, it started with a bunch of very generic ideas about reliability and validity. It focused more on reliability. Through the chat it started talking about classroom assessment, and I tried to redirect it again.

Eventually, it recommended some books — or were they articles? Well, either way, they were focused on classroom assessment. So, I asked about standardized test development. Most of what it recommended was more about psychometrics. But one of the recommendations seemed on point! It was a book (or article?) that I’d never heard of.

"Constructing Effective Test Items" by Susan M. Brookhart

Fascinating. Let me see if I can find that. I’d love to know more about it.

  • Google found nothing. (Well, it found two pdfs that do not mention such a piece, those all those words do appear. There’s a reference to another piece by SM Brookhart. Well, co-authored by SM Brookhart.)

Google reports no “great matches.”

  • Google scholar found nothing. (Well, it found the same thing. Cites it as by TP Hall, when that is actually just the location (i.e., Tate Page Hall 334) of the meeting for which the PDF is a copy of the agenda).

Google Scholar reports one results, but it is just a a meeting agenda.

  • Google books did not find it, either.

No results found for "Constructing Effective Test Items" by Susan M. Brookhart

Google Books reports no results

  • Bing has offered more results, but not to the article. Rather, they are to Dr. Brookhart’s website, ASCD page, ResearchGate page, etc. There’s also a link to something by a Tamanna Kalim over at Bangladesh Education Article. Now, it is not the worst article in the world, but it is not by Dr. Brookhart and is relatively generic.

This means that ChatGPT just made up this book (or article) that it claims exists. Just made it up. Sure, it is plausible that she would write such a thing. But it does not exist. I am not saying that it is a bad book (or article). I am not saying that it is irrelevant or poorly named. I am sayin that it does not exist.

Even when ChatGPT provides references, it cannot be trusted. It just makes them up — because that it what it is designed to do. Present something plausible, something that like what is out there — without any regard to accuracy.

Do not trust ChatGPT. It is just not credible.

Addendum

Five days later, I have tried again. I had a slightly different conversation — one a little more focused on challenges of alignment and false negative inferences (i.e., Type II errors). ChatGPT was quite confident.

Yes, if you have two test questions that include the same two skills or concepts, you can designate one of the questions as being focused on the first skill and the other as being focused on the second skill. This can be an effective way to assess both skills within the same test, as long as the questions are well-constructed and aligned with the assessment objectives. However, it is important to ensure that the two questions are not too similar, and that they are measuring the skills or concepts in different ways. For example, if both questions focus on the same mathematical concept but one uses a more advanced problem-solving approach than the other, it may be more appropriate to consider the first question as the primary focus and the second question as a secondary focus. This can help ensure that the test provides a balanced and comprehensive assessment of the test taker's knowledge and abilities.

We had gone back and forth a bit on this point, and so finally I asked, “Where can I learn more about this?” ChaptGPT’s first bulleted response was:

Reading books or articles on the subject, such as "Test Development: A Beginner's Guide" by Ruth E. Newton and David E. Wright, or "Standardized Testing: A Practical Approach" by David M. Nichols and Richard C. Sweetland.

These two books do not exist.

2nd Addendum

This week’s Slate Money begins with a segment on ChatGPT, and Emily Peck quickly makes reference to this issue that I had shared with her.