I'm sorry, but you nowhere point in any way that AI is objectively not predicting the next word, and this was supposed to be the selling point of this post : (
I was expecting a technical, or at least qualitatively technical, argument in the direction that the title of the article promises. (But is this explanation really needed?)
Now all we have is people spreading this article and saying "see? i told you! it doesn't just predict the next word anymore" -- but there is nothing to support this claim.
I'm not saying that you are wrong -- I'm saying that you literally didn't say why you believe it doesn't predict etc.
Thanks for elaborating on it - it sounds like you're taking me as arguing "AI doesn't predict" rather than "AI doesn't *just* predict," but the former isn't really my thesis.
I'm sure you've read this though, and so if you believe there's nothing in it that supports the claim of the title, we may have to just agree to disagree! I think there's quite a lot in it that supports the title
You are not convinced that I point in some way that AI is not predicting the next word? Rofl, you can do better than mirroring, pal. Also, you are right, because I believe that it does predict the next word.
One small issue I have: Technically the people who just want a bit to point at and say "that's the next-token prediction, it's still happening, it's right there" still have a bit to point at, and I don't think you really refuted that here. In fact, there's two bits!
There's still a next-token distribution getting sampled from at each token. I think the crux is: A distribution modelling what? Because it's not certainly not trying to maximize the likelihood of any existing data set, not in modern models. It's an engineering artifact that gets pushed every which way by different training phases with different losses and rewards and KL penalties etc..
It *is* still very useful to know about the sampling loop, but it's useful in the same sense as "wheeled vehicle" is a good and accurate descriptor of both bicycles and wheeled excavators.
Also, there's still pre-training on a large text dataset. As far as I know, it hasn't gone away as such, it's just relegated to being the first part of a much longer training process. I think here the computer vision people have a very good term in "pretext task". The more post-training you do, the more you can say that the purpose of the pretraining was only to learn internal representations, and the final model's behavior can go arbitrarily far from that of the base model.
I think people like to reduce complexity to things they can understand. Their mental model of "next word prediction" is that the LLM runs though some probability table and out pops a word. I don't think LLMs ever did that. What they have always done is something closer to "document completion" - and this task is accomplished by emitting one token at a time. I have no evidence to back up my theory but logically I can see earlier layers of the LLM forming very high level abstractions of the remaining document structure - number of paragraphs, abstract intent of each paragraph, etc. And then only in the latter layers of the transformer does the model begin to crystallise out the abstractions at the document head, zooming into paragraph 1, then sentence 1 then finally word one. To think it would just jump directly to the first word in a 1000 word response is a a pretty ridiculous thought imo.
This is a great article, the key insight of which is just because models are trained by next-word prediction (when they are) doesn’t mean that the end resultant model is just doing next word prediction. More complicated dynamics may emerge in the service of optimization, a problem that is perfect for mechanistic interpretability to investigate.
Thanks yeah, it's hard to know exactly where to draw the line between "there is prediction of the next-word happening, at *some* level" from "it's *just* a next-word predictor," but I think we're definitely past it now!
I especially like the part about the implications that come with the phrase "just predicting the next word". To many people, this implies that AI is not capable.
We need to consider what AI can do and will do, rather than dismissing it based on how it got there.
Thanks yeah well said - I think the most central word here might be "just," as in, it is reductive to think that there is _only_ word-prediction happening, or that somehow that means the impacts can't be concerning!
This is a thought provoking article, thank you. I am curious your thoughts on how much we can expect the success of an LLM on one class of problems to generalize to entirely other classes of problems. For example, as you state, frontier LLMs now excel at the IMO math problems. However, one could claim the success is because a huge effort was made to train the LLMs on exactly those kinds of problems, ie millions of variations of IMO-like math problems. An LLM hyper trained on IMO problems may be no better at solving unrelated problems. A contrarian to your thesis would say: It’s kind of like chess AI thirty years ago - deep blue beat Garry Kasparov, but it couldn’t do anything else that it wasn’t specifically trained on.
You’re welcome! Yeah it’s a fair question to what extent the performance will generalize. One datapoint here is that the same system that won IMO Gold also won gold at the International Olympiad in Informatics, but it’s hard to know from the outside exactly what it means for these to be the same systems. Likewise it’s hard to know exactly how much data is necessary in a domain before it starts excelling. I think Deep Blue was way more hand-tuned for chess fwiw - there’s a reason it still took so many years to get AlphaZero, which could learn chess but also other games. Reasoning models seem more like that to me, in terms of their generality
Thanks for being sympathetic to concerns about anthropomorphizing LLM's, the liberal arts major in me appreciates that. Now please stop worrying about it and apologizing for it! It is becoming abundantly clear to close observers of ai from outside the tech world--as I would describe myself--that anthropomorphizing is likely the best current model to explain functions and capabilities, and is becoming more so as capabilities increase. My jaw dropped a little when I heard Zvi publically say this: "I don't think people are anthropomorphizing it enough."
Which was funny to me, because every time I read an example of an LLM getting something like "20 pounds of feathers" wrong, it just made me think "oh, they are just like us..."
Yeah I struggle with the right amount to do here - some people do seem to hate it incredibly viscerally, and you basically lose their ear the second you do it, even if it’s not really central to the point. But I still think it’s useful! Alas
Thank you for providing and exploration of this rather than hot takes and predictions. It’s important for me that AI isn’t spoken in absolute truths and we examine, like you have, how it’s evolving.
I really enjoyed reading how you’re seeing LLMs evolve into pragmatics - we’re no longer working with Steve Harvey and “survey says” but a little further into seeing the sentence in context and shaping meaning from there. I do think it’s getting better at seeing the bigger picture rather than seeing the discrete word choices as abstract LEGO blocks. RLHF is an interesting one for me, that’s still pattern matching just on a pragmatics scale - you’re matching tone and often trained empathy, helpfulness etc in the reinforced model. I’ve been looking at it in the customer service realm but it’s fascinating - thank you for digging deep here.
My only thought, purely from linguistics and pedagogy is putting language in the same bucket as maths is not something that speaks to the same process. I do get where you’re coming from, they aren’t completely separate, there is overlap but they are different cognitive processes. With maths, you’re either right or wrong - with language, you can reason and explore the grey. For me, they’re very different ways of how the brain works and how we express ourselves so it being good at solving for mathematics or chess isn’t an indication of language skill. Again, I’m not speaking from any other perspective than linguistics so just a narrow perspective.
Yeah I think you're right about the difference of verifiability/correctness of math vs language in general. It's funny, that's part of why copywriting was the first GPT use-case to really take off - companies like CopyAI, Jasper, etc - because writing was so much more fault-tolerant. And in fact, the randomness of AI writing was a virtue for that use-case, because it was basically analogous to creativity!
Admirable effort educating the chatterati, even if expecting it’ll be futile. ‘Parroting the next word’ is a very strong and convenient metaphor, that people that think in metaphors, and speak in metaphors - will stick to it against all reason. Ilya’s ‘who done it’ last word in a novel generalises. It’s not the next word prediction (that was done for donkey years), but the quality of the prediction. It’s the quality of the prediction that’s new. That the model gets it right much more than it gets it wrong.
😊 Thanks afraid can't offer anything better. Your effort is valiant and hope improves the public discourse. Revealed preferences are that people are using and trusting AI-s, despite the maybe hostile reaction by the elite that sees it a threat to their position in society. We know the level of use because we see the revenue of the AI companies growing. People are very honest when they part with their cash. 😁
This post is doing great traffic, but it’s hard to tell where it’s coming from. If you came from here from a link elsewhere, I’d love to know where from!
I'm sorry, but you nowhere point in any way that AI is objectively not predicting the next word, and this was supposed to be the selling point of this post : (
Hi! Appreciate the feedback, want to better understand - can you say more about what you were hoping to see in it?
I was expecting a technical, or at least qualitatively technical, argument in the direction that the title of the article promises. (But is this explanation really needed?)
Now all we have is people spreading this article and saying "see? i told you! it doesn't just predict the next word anymore" -- but there is nothing to support this claim.
I'm not saying that you are wrong -- I'm saying that you literally didn't say why you believe it doesn't predict etc.
Thanks for elaborating on it - it sounds like you're taking me as arguing "AI doesn't predict" rather than "AI doesn't *just* predict," but the former isn't really my thesis.
This is the section that gets most into the details on the literal claim of the title, as opposed to pointing out "the implications of the common 'just predicting' phrase are mistaken": https://stevenadler.substack.com/i/182666817/ai-is-no-longer-well-described-as-just-predicting-the-next-word
I'm sure you've read this though, and so if you believe there's nothing in it that supports the claim of the title, we may have to just agree to disagree! I think there's quite a lot in it that supports the title
I'm not convinced that you aren't also doing that
You are not convinced that I point in some way that AI is not predicting the next word? Rofl, you can do better than mirroring, pal. Also, you are right, because I believe that it does predict the next word.
No insulting people in my comments please.
One small issue I have: Technically the people who just want a bit to point at and say "that's the next-token prediction, it's still happening, it's right there" still have a bit to point at, and I don't think you really refuted that here. In fact, there's two bits!
There's still a next-token distribution getting sampled from at each token. I think the crux is: A distribution modelling what? Because it's not certainly not trying to maximize the likelihood of any existing data set, not in modern models. It's an engineering artifact that gets pushed every which way by different training phases with different losses and rewards and KL penalties etc..
It *is* still very useful to know about the sampling loop, but it's useful in the same sense as "wheeled vehicle" is a good and accurate descriptor of both bicycles and wheeled excavators.
Also, there's still pre-training on a large text dataset. As far as I know, it hasn't gone away as such, it's just relegated to being the first part of a much longer training process. I think here the computer vision people have a very good term in "pretext task". The more post-training you do, the more you can say that the purpose of the pretraining was only to learn internal representations, and the final model's behavior can go arbitrarily far from that of the base model.
Well said, appreciate you pushing on this. Especially the description of what the distribution is ‘aiming at’
There's so much noise generated by both the evangelists and the doomsayers. It's hard to get any attention with a nuanced view.
Thanks.
You're welcome! If you share this with anyone and they have interesting responses, I'd love to hear them
Very well articulated. I tend to make the exact same points, but with less patience and empathy. Grateful that you're modeling high-quality discourse.
Thanks so much - yeah I hope it’s clear to folks that I understand why they used to believe this, it’s just not correct anymore!
I think people like to reduce complexity to things they can understand. Their mental model of "next word prediction" is that the LLM runs though some probability table and out pops a word. I don't think LLMs ever did that. What they have always done is something closer to "document completion" - and this task is accomplished by emitting one token at a time. I have no evidence to back up my theory but logically I can see earlier layers of the LLM forming very high level abstractions of the remaining document structure - number of paragraphs, abstract intent of each paragraph, etc. And then only in the latter layers of the transformer does the model begin to crystallise out the abstractions at the document head, zooming into paragraph 1, then sentence 1 then finally word one. To think it would just jump directly to the first word in a 1000 word response is a a pretty ridiculous thought imo.
You might be interested in checking out the Anthropic link from the footnotes, if you haven't seen it before! https://www.anthropic.com/research/tracing-thoughts-language-model
They do a bunch of investigative work on how LLMs were choosing tokens, back in the more-straightforwardly-predicting-the-next-token days
All it says, and all that has ever been true, is that LLMs output one word at a time.
So did Shakespeare, coincidentally.
This is a great article, the key insight of which is just because models are trained by next-word prediction (when they are) doesn’t mean that the end resultant model is just doing next word prediction. More complicated dynamics may emerge in the service of optimization, a problem that is perfect for mechanistic interpretability to investigate.
Thanks yeah, it's hard to know exactly where to draw the line between "there is prediction of the next-word happening, at *some* level" from "it's *just* a next-word predictor," but I think we're definitely past it now!
Absolutely agree !
Super thoughtful. Well done
Thank you! Appreciate you saying so
I especially like the part about the implications that come with the phrase "just predicting the next word". To many people, this implies that AI is not capable.
We need to consider what AI can do and will do, rather than dismissing it based on how it got there.
Thanks yeah well said - I think the most central word here might be "just," as in, it is reductive to think that there is _only_ word-prediction happening, or that somehow that means the impacts can't be concerning!
This is a thought provoking article, thank you. I am curious your thoughts on how much we can expect the success of an LLM on one class of problems to generalize to entirely other classes of problems. For example, as you state, frontier LLMs now excel at the IMO math problems. However, one could claim the success is because a huge effort was made to train the LLMs on exactly those kinds of problems, ie millions of variations of IMO-like math problems. An LLM hyper trained on IMO problems may be no better at solving unrelated problems. A contrarian to your thesis would say: It’s kind of like chess AI thirty years ago - deep blue beat Garry Kasparov, but it couldn’t do anything else that it wasn’t specifically trained on.
You’re welcome! Yeah it’s a fair question to what extent the performance will generalize. One datapoint here is that the same system that won IMO Gold also won gold at the International Olympiad in Informatics, but it’s hard to know from the outside exactly what it means for these to be the same systems. Likewise it’s hard to know exactly how much data is necessary in a domain before it starts excelling. I think Deep Blue was way more hand-tuned for chess fwiw - there’s a reason it still took so many years to get AlphaZero, which could learn chess but also other games. Reasoning models seem more like that to me, in terms of their generality
Thanks for being sympathetic to concerns about anthropomorphizing LLM's, the liberal arts major in me appreciates that. Now please stop worrying about it and apologizing for it! It is becoming abundantly clear to close observers of ai from outside the tech world--as I would describe myself--that anthropomorphizing is likely the best current model to explain functions and capabilities, and is becoming more so as capabilities increase. My jaw dropped a little when I heard Zvi publically say this: "I don't think people are anthropomorphizing it enough."
Which was funny to me, because every time I read an example of an LLM getting something like "20 pounds of feathers" wrong, it just made me think "oh, they are just like us..."
Yeah I struggle with the right amount to do here - some people do seem to hate it incredibly viscerally, and you basically lose their ear the second you do it, even if it’s not really central to the point. But I still think it’s useful! Alas
Sorry I'm late to the party, but this is an excellent piece. Sharing far and wide.
Also, I see what you did there with the URL.
Oh that’s funny, Substack has a sense of humor of its own I guess! All their doing
Thank you for providing and exploration of this rather than hot takes and predictions. It’s important for me that AI isn’t spoken in absolute truths and we examine, like you have, how it’s evolving.
I really enjoyed reading how you’re seeing LLMs evolve into pragmatics - we’re no longer working with Steve Harvey and “survey says” but a little further into seeing the sentence in context and shaping meaning from there. I do think it’s getting better at seeing the bigger picture rather than seeing the discrete word choices as abstract LEGO blocks. RLHF is an interesting one for me, that’s still pattern matching just on a pragmatics scale - you’re matching tone and often trained empathy, helpfulness etc in the reinforced model. I’ve been looking at it in the customer service realm but it’s fascinating - thank you for digging deep here.
My only thought, purely from linguistics and pedagogy is putting language in the same bucket as maths is not something that speaks to the same process. I do get where you’re coming from, they aren’t completely separate, there is overlap but they are different cognitive processes. With maths, you’re either right or wrong - with language, you can reason and explore the grey. For me, they’re very different ways of how the brain works and how we express ourselves so it being good at solving for mathematics or chess isn’t an indication of language skill. Again, I’m not speaking from any other perspective than linguistics so just a narrow perspective.
Such great piece - thank you!
Yeah I think you're right about the difference of verifiability/correctness of math vs language in general. It's funny, that's part of why copywriting was the first GPT use-case to really take off - companies like CopyAI, Jasper, etc - because writing was so much more fault-tolerant. And in fact, the randomness of AI writing was a virtue for that use-case, because it was basically analogous to creativity!
That was a great read Steven. We need more explanations that make current systems legible without overselling or underselling. Yours does both.
Thank you! Writing it was helpful for figuring out what I actually believe about this, so that was useful too!
Admirable effort educating the chatterati, even if expecting it’ll be futile. ‘Parroting the next word’ is a very strong and convenient metaphor, that people that think in metaphors, and speak in metaphors - will stick to it against all reason. Ilya’s ‘who done it’ last word in a novel generalises. It’s not the next word prediction (that was done for donkey years), but the quality of the prediction. It’s the quality of the prediction that’s new. That the model gets it right much more than it gets it wrong.
I hope the path-finder metaphor has just enough oomph to it that people pick it up, but I’m very open to alternatives!
😊 Thanks afraid can't offer anything better. Your effort is valiant and hope improves the public discourse. Revealed preferences are that people are using and trusting AI-s, despite the maybe hostile reaction by the elite that sees it a threat to their position in society. We know the level of use because we see the revenue of the AI companies growing. People are very honest when they part with their cash. 😁
Epic article Steven.
Thank you! I hope it's helpful for conveying all that's changed.
I found it very insightful!
This post is doing great traffic, but it’s hard to tell where it’s coming from. If you came from here from a link elsewhere, I’d love to know where from!