ChatGPT Is the Opposite of the Golem

June 4, 2023
Dr. Mitch Marcus

The kabbalist Rabbi Judah Loeb, according to Jewish legend, created a Golem out of mud to protect the Jewish community of Prague in the late 1500s. Animated by the utterance of letters from hidden names of God, and finally, by the word “EMET” (Truth) etched in the clay of its forehead, the Golem came to life. It could think, plan and act, but it was different from humans in one crucial way; it could not speak.

Within the last 10 years, my fellow practitioners of artificial intelligence (AI), in many ways the practical kabbalists of the 21st century, have created what we call “Large Language Models,” or LLMs, which exploded into the public consciousness in the last two or three years with the release of ChatGPT and similar LLMs that can answer questions, explain concepts and write pretty good high school short papers.

While at first glance, the Golem and ChatGPT may seem to be closely related, they are opposites. The Golem was mute, while ChatGPT speaks. The Golem could think, while ChatGPT, as I will show, does nothing that could be described as thinking. The Golem had Emet (Truth) etched on its forehead, while ChatGPT operates by probabilistic models of word sequences where truth cannot play any part. But like Golems, these LLMs are highly dangerous if misused. And I strongly believe that they are being misused. But before I can explain to you why this is true, you need to understand — in simplified form — how they work. This gets a little technical, and a little geeky, but only a little.

While chatbots are quite complicated, in essence, they are surprisingly simple. First, a little bit of complicated stuff: Chatbots are what AI practitioners call “Large Language Models,” where “language model” is a technical term for a statistical model that, given a text so far, gives the probability of each word in the language that it knows of coming next. “Large” here means both that the language model used billions and billions of words (to misquote Carl Sagan) to compute the probabilities and that the statistical model itself consists of billions and billions of numbers (called “parameters”).

Here’s the simple core idea: Start with a simple language game. Given the two-word sequence “upon a,” what word is likely to follow next? You would guess that the word “time” is very likely, with other nouns and adjectives possible, and words like verbs, prepositions and the like much less possible. Given the three-word sequence “Once upon a,” the word “time” is now extremely likely, with “midnight” (as in “midnight dreary”) very probable if the phrase “Edgar Allan Poe” was in the last 100 words of text.

Now how could we get a computer to do this? Easy! For the “upon a” case, just count every time that phrase occurs in a large collection of text, then simply count for every word, say “time”: how often it occurs immediately after “upon a,” and simply divide the second number by the first. This is just the probability that “time” follows “upon a.” We could do this for every word we found in the text and would then have a good estimate of the probability of each word following “upon a.” It is then easy to write a program that would pick a word randomly to follow “upon a,” but with the rule that a more probable word is picked more often than a less probable word.

And now, we can easily build a program to generate text given a couple of starting words. To do this, we need to do the same counting and arithmetic for every pair of words that we did above for “upon a” as the first two words in the sequence, and then for every single word, given that those two words immediately preceded it. This will take lots of boring computation, but that’s what computers are for. So we start with two words, say “once upon,” and pick a word to follow it, with probable words after “once upon” most likely. Now we take the last two words, “upon a” if “a” is what got picked, and repeat. If the next word picked is “time,” we take the two words “a time” and generate another word, and so on. This is precisely a “language model.” (All of this, spelled out in a little more detail, but not much, was early programming homework for one of my classes at Penn for the last 15 years or so.)

How well does this simple language generator work? Well, if trained on everything Shakespeare wrote, for two-word sequences, as above, we get texts like “This shall forbid it should be branded, if renown made it empty” and “Indeed the duke; and had a very good friend.” Not bad, but also not very good! But if we simply do the same thing with three word sequences, we get a much better result: “It cannot be but so” or “Will you not tell me who I am?” If we go to four-word sequences, we get back almost the original Shakespeare text. But there’s a high cost to these models that use these longer word sequences. To get a probability of the next word after each two-word pair, we need to compute a table for each of these pairs. How many pairs might there be? If there are only 5,000 different words in our texts, there can be 5,000 times 5,000 different word pairs — 25 million of them! For three-word sequences, we can have these 25 million followed by each of the 5,000, or 125 billion! So now we see one meaning of “large” in “Large Language Model.”

ChatGPT is based on this simple idea, but adds to it a number of very powerful mathematical tricks, among them: It uses methods that can actually look back thousands of words in computing probabilities, in part by ignoring word sequences that don’t make much difference to what follows later. It uses some very powerful applied mathematics that goes under the name “deep neural nets” to combine guesses over very many different spans of texts so it can predict based on the structure of tens of words (“sentences”), hundreds of words (“paragraphs”) and thousands of words (“essay structure”). And how does it answer a question? It treats the question itself as the initial text, with a couple of words added by the system: “QUESTION <your question goes here> ANSWER.”

Given just this much, it’s now easy to see why ChatGPT and the like often generate texts that are silly, wrong, “hallucinatory” and the like. (My favorite thing to do with ChatGPT is ask it to write biographies of people who the Internet knows something about but not lots, like colleagues at other universities. It invents wonderful false facts about them, usually of the most laudatory nature.) In fact, given what we’ve seen, how could ChatGPT do otherwise? It just predicts the next word of the text it is generating, over and over and over. So it follows that it can’t reason, and it certainly can’t do logic. ChatGPT is famously wrong if forced to do arithmetic, and we can see why: If given the text “2 + 2 =,” it will likely find enough examples of these four “words” followed by “4” that it will get it right. But if this is a word problem (“Alice had two apples, and then Bob stole one apple and Cheryl ate another”), it doesn’t stand much of a chance unless its training material happened to include lots of word problems with something like this template.

This understanding also makes clear why we cannot solve the problem by training an LLM on not just everything, both true and false, that can be vacuumed up from the web, but only on material that is known to be accurate. The LLM doesn’t extract the factoids, true or false, that the text encodes but rather just probabilities of word combinations. Truth just doesn’t come into the equation .

For example, consider parts of ChatGPT’s response (the version of May 24, 2023) to my request to “write a biography of Mitch Marcus, the professor at Penn.” The response correctly identifies my field (“natural language processing and computational linguistics”), as well as correctly identifying some of my particular contributions. But it falsely says that “Dr. Marcus earned his Ph.D. in computer science from Stanford University and began his academic career at the University of Texas at Austin.” Why from Stanford? It has learned from the probabilities of words in academic bios that the first paragraph of that sort of text will end with high probability with “has made significant contributions to … ,” and it starts my bio that way, but then with high probability this should be followed by a paragraph with <name> “earned his/her Ph.D. dissertation … at <university name>.” So it generates text with that form and picks a highly likely phrase to follow “Ph.D. in computer science from” in academic bios. If it had seen text like this that started with “Mitch Marcus,” it would have included my name since this continuation would have then had a slightly higher probability. For pretty much the same reason, it falsely generates “University of Texas at Austin.” Note that “true” or “false” doesn’t enter into this conversation, merely what strings are more likely to be in the same locality of text as other strings.

And so we see that ChatGPT is the opposite of the Golem. The Golem could think and reason, but it couldn’t speak. ChatGPT can’t think and can’t reason, but it does a great job of speaking!

There is a big surprise here: Programs that merely do prediction of the next word in this way, if trained on lots and lots of different text, generate text that looks like thinking much of the time. Why is this? The answer is complicated, of course. But the paper cited above by Emily Bender, et al, points in the right direction, as they characterize these LLMs as “stochastic parrots,” where “stochastic” is just a fancy word for “using probabilities.” They argue that the reason that programs that generate text seem to us to be reasoning is that human beings are naturally predisposed to assume that things that generate language (like us!) actually mean what they say. They say

Our perception of natural language text, regardless of how it was generated, is mediated by … our predisposition to interpret communicative acts as conveying coherent meaning and intent, whether or not they .[i][ii]

[Or in simpler language, “If one side of the communication does not have meaning, then the comprehension of the implicit meaning is an illusion arising from our singular human understanding of language … .” It is this illusion that leads the users of ChatGPT and other LLMs deployed in commercial products to assume that the text that ChatGPT returns to them is intended, is true, and means something. Human beings mean when they speak. They may be confused or inaccurate or lie, but they either mean to be truthful or they mean to deceive. They have intentions, either to be helpful or to deceive. LLMs, as we have seen, don’t mean, and, as they are currently built, cannot mean. Users of this technology are deceived by, as Bender et. al. say “our predisposition to interpret communicative acts as conveying … meaning.” To deploy technologies that deceive is wrong.

When AI practitioners and our immediate friends used the predecessors of ChatGPT for ourselves to generate text plots, or rewrite the weather report in iambic pentameter, or compose a midrash on a lightbulb burning out in the style of Midrash Rabbah (as a friend showed me), this was fun and a great magic trick. Deception as part of a magic act — when the users know they are being deceived and enjoy it — is OK. If the LLM occasionally generated a text that is silly or wrong or hallucinatory, we knew to ignore it. The development of this technology as a stepping stone to something better is not a bad thing, in my view. But deploying it in its current form is unacceptable.

As we now all know, LLMs are also destructive in that they are creating a serious moral hazard; they make it easy for high school students, for example, to hand in ChatGPT’s work as their own. In the words of the rabbis, they put a stumbling block before the blind. And in doing so, they also cause the learning that was meant to happen by researching and writing that paper not to occur.

So, are there acceptable uses for this technology now? Yes, and they are already with us, although not broadcast on the news as a breakthrough in AI. Could such a technology be appropriate for the inappropriate uses we see proliferating now? Yes, and researchers, I am happy to say, are already taking up the significant technical challenges that need to be overcome to create a morally acceptable form of this technology.

My oldest son, a high-level computer system designer, tells me that he uses LLMs every day to generate code that he then carefully reads over and repairs. The LLM provides him with a first draft of the code that often gets the easy stuff right and saves him lots of time, providing what friends of mine called a “Programmer’s Apprentice” when they tried to build one years ago. Similarly, I have recently read announcements of seminars and the like where LLMs have taken in some text about the seminar series, including many earlier announcements, drafted an announcement, and a human has then checked over the first draft produced by the LLM. LLMs are a wonderful tool to draft the easy parts of lots of white-collar work when the person who edits the output understands what the program is likely to do and knows to take responsibility for assuring that the final program does what was intended or the final text says what was intended.

To create a technology that can be used as ChatGPT has been sold to the general public will require at least some of the following capabilities: First, it should be clear when such systems fail to produce truthful output. Second, since all technologies fail sometimes, systems must be able to accurately explain why they present the evidence they do and to reference underlying documents that support what they say. Third, and ultimately, Unfortunately, none of these can result from superficial modifications of the current LLMs. One way forward may well be to engineer a system with many LLMs, each of which does a limited substep of the complete task, with other computational processes running on the output of each that can assure the accuracy of the substep that LLM performed. This may well work in computational contexts where checking a result is much easier than generating the result itself.

Can we AI practitioners pull this off? I don’t know, but it will be fun to try. But in the meantime, we have a responsibility to do what we can to prevent the existing technology — a step on the road to an honest technology — from being misused as it is currently being misused. This article is a small down payment on my doing my part.

Bender, Emily M., Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?🦜.” In proceedings of the 2021 ACM conference on fairness, accountability and transparency, pp. 610-623.
Bender, et al.