LLMs only predict the next word in a sentence

Why I’m writing this

A few weeks ago, while sitting around a campfire (and being destroyed by mosquitoes), my friends and I attempted to define what it means to “think”.

I began contemplating whether what an LLM does could be called thinking. Although I would’ve loved to discuss that with my friends, they are not familiar with how an LLM works. And I certainly did not want to be the guy talking about embeddings, self-attention, and autoregression.

If I were, however, to try to explain how an LLM generates responses, I’d go about it like this.

Predicting the next word

LLMs (like ChatGPT) simply predict the next word in a sentence is.

Let’s say you ask the LLM “How does a car work?”. Here’s how the answer gets generated¹:

The LLM generates, not the full answer to your question, but just the first word of the answer.

After seeing billions of english texts during training, it has learned that the answer to a question like “How does a car work?”, would probably start with “A”. So it returns to you “A”.

At this point, the LLM doesn’t know what the rest of its answer is going to be, it just thinks that “A” is a good way to start².

Then it is, again, given your question, but also its previous prediction, and it predicts the next-next word, which is “car”. Then the next-next-next word, and so on.

Well, not just the next word

I need to make a clarification: I’ve been saying “LLMs predict the next word”. But that’s not the whole truth. The LLM doesn’t just tell you the next word. It returns to you how likely it is that each word it knows will be next word in the sentence.

A visualization will make more sense:

In the first step, the LLM is saying “I think there’s a 43% chance the next word is “A”, an 18% chance it is “The”, a 12% chance it is “It” and so on…”.

It is essentially a ranking: what’s the most likely next word, the second most likely next word, the third one, and so on.

In technical terms, it’s returning a probability distribution over all possible words in its vocabulary. We’ll touch on vocabulary here.

Pick your poison (aka: Top-k)

When the LLM responds with its ranking, you can choose which word to pick. If you always pick the most likely word (ranked 1^st), you get (in a way) the most likely response to your question^3,4.

However, you could pick one of the top 3, or top 5 (or top-k) words as well. The lower the rank of the words you pick, the more unpredictable the answer becomes. Here is an example you can play with:

Top-k: 1

Top-1Top-3Top-5Top-20

Input: Describe a sunset in one sentence.

Output: The sun sets slowly, painting the sky in shades of orange and pink.

There’s a point after which the response is just gibberish. And this makes sense! As the top-k increases, more unlikely words can get added to your sentence, making it more unpredictable. The probability of “dolphin” showing up in a sentence describing a sunset is low, but not zero. Only when you set top-k=20 is the unlikely word “dolphin” able to make it into your sentence.

The key is to find the balance for your use case. If you are writing a legal document, a very sensible and reliable response is best. So picking the most likely word (which is known as top-1, or temperature of zero) is a good choice. But if you are writing a poem, you’d benefit from more randomness (which can come off as creativity) so top-5 might be better.

This one little setting can significantly change the personality of the LLM! When you use ChatGPT, OpenAI picks this setting for you⁵.

Actually, not even words

An important confession: LLMs don’t predict words at all. In fact, LLMs probably don’t even know what words are. Instead LLMs predict tokens. Tokens are like pieces of words; think of them as syllables. I am introducing tokens now as opposed to earlier to keep explanations digestable.

This is what it actually looks like:

When I first learned about tokens, I thought it was kind of stupid. Why would an LLM predict pieces of words instead of complete words? It seemed overcomplicated. But it turns out there are good reasons.

To LLMs, there’s nothing special about words, they’re just letters strung together. Tokens are also just letters strung together, they just happen to be shorter. Humans have a special attachment to words because they carry meaning for us, while syllables don’t. But for an LLM, it’s all the same.

So why did researchers choose to build LLMs to operate over tokens instead of whole words?

A big reason is vocabulary. Recall from earlier that LLMs predict the next word, but how does it even know what words it can predict in the first place? Well, when the LLM is built, it is given a vocabulary. These are the set of words it can use. When it predicts, it ranks all the words in its vocabulary from most to least likely⁶.

If we want LLMs to predict actual words, then the LLM’s vocabulary must contain every possible word. That’s almost a million entries, including rare words like “deoxyribonucleic”. Now the LLM will have to rank how likely it is that “deoxyribonucleic” is the next word in a sentence every single time it makes a prediction. This is a huge waste of resources since “deoxyribonucleic” is rarely used.

Instead, we can give it pieces of words, and it can combine these pieces to create whatever word it needs. Take “unbelievable” as an example. If we split it into chunks: “un”, “be”, “lie”, “ve”, “ab”, “le”, then we can contruct “believe”, “believeable”, “able”, “unable”, “be”, “lie”, and more for free.

Since it’s all the same for LLMs words and tokens are all the same to LLMs, its more efficient to use tokens ⁷.

So are LLMs thinking?

So now, think to yourself: are LLMs “thinking”? Who knows! What does thinking even mean? That answer probably lies in the realm of philosophy and consciousness. Perhaps a topic for another day.

What can be said, however, is that it’s fascinating that a process so simple produces such intelligent behavior. Now, I am certainly dumbing LLMs down; no part of hiring hundreds of PhDs, building a massive data center, gathering all available human text, architecting a complex model with billions of parameters, and then training it should be referred to as “simple.” But the thought that an LLM is merely predicting the next few letters at a time is still mind-blowing.

Footnotes

These are a slightly more technical.

The system that sits between you and the LLM handles the loop for passing the LLM’s response back to itself. Remember, the only operation the LLM can perform is predict the next token. Hence, there needs to be a system that feeds the LLM the sentence again and again until the LLM says the sentence is complete. The same system is what adds the “Question:” and “Answer:” labels (which I am using as an analogy to System, Agent, and User labels in agentic systems).
“…it just thinks that “A” is a good way to start” is an ambiguous statement. I should refrain from using words like “think” or “know”. What do these words even mean in this context. If the LLM can give you a correct answer, did it “know” the answer? Does regurgitating the most probable sequence of tokens count as “knowing”? These are philosophical questions. I continue to use these words because they simplify layman explanations. I think being 20% less rigorous can make me 80% easier to understand (those numbers are fairly arbitrary). If I were fully pedantic, “it just thinks that “A” is a good way to start” should instead be “It computes that the probability that the answer starts with “A” is higher than the probability of all other tokens, assuming top-k=1”. You see how that’s not something my mom would be interested in understanding?
I used the phrase “most likely response” in handwavy fashion so it could mean many things. Within the realm of autoregressive models, here’s two definitions I could think of:
1. The response generated by greedily sampling the most likely token at each step:
$x = (x_1, x_2, \ldots, x_n), \quad x_i = \arg\max_{w \in V} \; P(y_i = w \mid y_{\lt i})$
1. The response that maximizes the total product of probabilites for each token:
$x^* = \arg\max_{x \in V^n} \; \prod_{i=1}^{n} P(x_i \mid x_{\lt i})$
If my understanding is correct, this is only source of randomness in the LLM. If you always picked the most likely word, you’d get the same response every time.
Top-k is a very basic sampling approach. I bet OpenAI is doing something more complex under the hood. I think they may be setting the inference hyperparemeters (if you can call them that) at run time after seeing the first prompt to guide the LLM’s personality to best serve the user’s request.
This ranking step isn’t actually done by the LLM. The LLM simply outputs a vector of probabilities, the system around the LLM sorts it. Again, keeping explanations 20% less rigorous to be 80% more understandable. This is how I learn best.
A good question at this point is: forget tokens, why not use the alphabet? Then we can create any word and the LLM only memorizes the alphabet. The problem with this is that the sequence to predict gets too long and since character level tokens encode very little information, the model performance drops. I should elaborate on this.

Jailbreaking LLMs: Competing Objectives and Mismatched Generalization