Is AI smarter than a house cat?
Turing Award winner Yann LeCun recently answered this question with a resounding “no”.
Are LLMs (Large Language Models) smarter than house cats? Turing Award winner Yann LeCun recently answered this question with a resounding “no” in his testimony to Congress:
“A cat can remember, can understand the physical world, can plan complex actions, can do some level of reasoning—actually much better than the biggest LLMs.”
Like LeCun, many AI experts believe eventual AGI will not be achieved by pure scaling up of LLMs. Other researchers disagree.
Which group is correct remains an open question, but we can learn a lot from examining their arguments. If scaled up LLMs do not reach the creativity of human scientists or mathematicians, then we should look skeptically at promises that research will be done by AIs soon, and we should be skeptical of arguments that AI will “replace” human employees. In the LLM-skeptic version of the world, current AIs will be closer to a human-augmenting tool that makes people more productive, not some kind of super-human agent; and the world would change much less than many currently think.
Why might LLMs fall short of AGI? One answer: embodiment
LLMs are trained on data from written texts — the internet, books, and other such data. Common sense would suggest that while you can learn an astonishing amount from such sources, there are many things that LLMs cannot learn. Navigating the physical world in unknown situations is one example: Steve Wozniak’s famous AGI test is “go into a kitchen, without prior knowledge, and figure out how to make a cup of coffee.” Being able to operate in the real world involves tacit knowledge, and not all of those gaps can be plugged by written text.
This is part of LeCun’s argument, which seems to have two main parts:
1. Intelligence is grounded in the physical world. When you look at a cat or another animal, you can see them reasoning, planning, and being goal oriented at a level higher than even current generation LLMs can achieve. More generally, LLMs learn through text data, but humans learn through the world first and foremost, with language coming later; and the sensory input we receive contains much of the world model that is core to our intelligence. This ‘core’ of intelligence is pre-language, and learning only through texts won’t capture any of that.
This would pose fundamental limitations for, say, LLMs understanding physics, or LLMs being capable of accomplishing complex goals in the real world. Their intelligence would be limited to ‘book learning’, and be brittle. (Consider the example of only learning psychology through reading peer-reviewed psychology papers, versus studying real humans.)
2. Intelligence involves understanding novel concepts and generalizing based on limited amounts of data. This is disputed, but the current evidence suggests that LLMs do not do this as well as humans. For example, early generation LLMs were able to multiply single-digit numbers and double-digit numbers, but failed at multiplying numbers with larger numbers of digits. This suggests they had not really learned the underlying concept of multiplication at the right level of abstraction.
As Tyler Cowen has recently pointed out, another example that works on GPT-4 is “Name three famous people who all share the exact same birth date and year.” Even current LLMs cannot do this task well, getting the month/day birth date correct but the year wrong. This task does not require complex reasoning, so it’s puzzling that LLMs cannot do this.
Advocates argue that failures like this go away with scale. But although you can train away any failure like this by giving the LLM examples, the broader critique stands: what happens when you encounter entirely novel situations that LLMs don’t have training data for? Would they be able to reason about and generalize about these situations, absent any training data? Examples like the above imply that they will get to “very good” but not quite superhuman at reasoning in novel situations.
There is some recent evidence for this: Meta recently released the Open-Vocabulary Embodied Question Answering (OpenEQA) benchmark, which measures an AI agent’s understanding of physical spaces via questions like “Where did I leave my badge?” [2]. Their conclusion is that even the best Vision-Language Models are “nearly blind”: “models leveraging visual information aren’t substantially benefitting from it and are falling back on priors about the world captured in text to answer visual questions”.
For answering questions such as “which room is directly behind me”, they found the models were more or less guessing at random, rather than using their physical memory to reason about the space, as humans would. In other words, even models that can “see” are bad at interpreting what they are seeing and building a mental model of it that they can use to reason.
This supports LeCun’s position: cats are, in some important sense, more intelligent than current-generation LLMs, and fundamental improvements on the perception and reasoning fronts are required. It remains an open question whether scaled up LLMs will get there.
Anti-skepticism
It is worth noting that LeCun is not skeptical about AGI in general; when asked when A.I. will actually surpass human intelligence, LeCun said, “Probably more than 10 years, maybe within 20.” [1] This is still relatively soon. Moreover, many of LeCun’s arguments are not definitive and these remain open questions.
Here are some reasons why LLM capabilities might be more expansive than he gives them credit for:
First, it is not entirely clear where, exactly, the boundaries of LLM capabilities will lie. Predicting that LLMs cannot do X usually goes poorly, and part of the surprise with modern LLMs is that they have emergent capabilities that weren’t a priori predictable or obvious from the training data, and that only emerged with sufficient scale.
Reasoning is one example: it seems clear that LLMs can, in fact, reason. Chess provides evidence for this: GPT-4 can play chess at about 1800 ELO, about the 90th percentile for a rated human chess player. This includes being able to play good moves in board situations it has never seen before. Yes, chess games were included in GPT-4’s training data, but the fact that it can deal well with chess games it has never seen before suggests that it has developed a good internal ‘chess engine’ that can deal with unfamiliar chess games, which in turn suggests that GPT-5 and GPT-6 are likely to be even better chess players.
Second, “sampling can prove the presence of knowledge but not its absence” [3]: in other words, it is difficult to say an LLM cannot do something, because as of 2024 the correct prompt still matters. You can ask an LLM to do something and it gets the answer wrong, and then if you phrase it slightly differently and offer to tip the LLM $2000 for a correct answer, it will give you the correct answer. Thus, although it is easy to prove that a given prompt fails, it is much harder to prove that any possible prompt will fail: this is another reason the boundaries of LLM capability are somewhat fuzzy. Larger LLMs may contain reasoning engines for all sorts of valuable questions, as long as we figure out how to correctly tap their power.
One thing machine learning is excellent at is finding deep structure in domains that are unintelligible or too complex for humans. For example, LLMs have found the “deep structure” of language and grammar, which is why they are able to write perfect English prose; this is a task that eluded human linguists, and it required linear algebra, large amounts of data, and large amounts of compute. Other domains that are ‘like language’ could be amenable to cracking in this way: an example might be biology. Both genetics and the manufacturing of biological molecules have these properties.
Third, many of the examples where LLMs fail are word-related. For example, LLMs cannot play Wordle correctly. Even simpler, asking them to do something like “list a set of individuals who share the exact same birthday and year”, or “name all British Prime Ministers with a repeated consecutive letter in one or more of their names”, continues to produce incorrect answers. However, these are all related to the specific way LLM inputs are constructed (“tokenization”) and we should expect these issues to disappear eventually.
High Crystallized Intelligence, Low Fluid Intelligence
Given the same data as Einstein, could more advanced LLMs come up with general relativity without being prompted to do so? This is an open question.
Going back to chess: neither GPT-4 nor Claude Opus can compete at chess with the best specialist chess engines. Even at significantly higher levels of scale, it would be surprising if they managed to beat AlphaZero-level engines, which are optimized specifically for those games.
Similarly in science: specialist applications such as AlphaFold have proven more important than LLMs, which so far have been useful to help scientists write grant applications faster and other such routine tasks.
This is what you would expect from looking at how LLMs work: they are able to interpolate well in a vector space given a decent amount of training data. But for entirely novel domains, where there is no training data, it is unclear how or whether LLMs will be able to reason in a way that guarantees the answer is actually correct and exact, versus ‘merely plausible’. This suggests that even the LLMs that come at the end of the 2020s would not be able to come up with general relativity, given the data Einstein had.
Science, and advancing human knowledge, involves reasoning and sometimes this involves making extremely improbable choices. The famous AlphaGo vs. Lee Sedol match involved several moments of transcendent creativity from the AI Go engine, notably its move 37 in game 2, which was so creative many of the human commentators initially thought it was a mistake. [4] The move turned out to be an exception to a general principle which AlphaGo had correctly reasoned did not apply in this particular board position.
AlphaGo could prove move 37 was the best because it was doing a form of search: calculating the consequences of the move, and evaluating them. LLMs do not do this currently, but most researchers agree that search will be a key part of a future AGI – much as humans plan and reason about the consequences of actions. Even when writing, one typically chooses between words and sentences based on how they fit into a whole; this is a form of reasoning which LLMs do not do, since they generate tokens step by step. Projects such as OpenAI’s rumored Q* are going after this kind of broader AGI paradigm.
All of this implies that LLMs, which tend to ‘reason’ using probability distribution over the next token, will be great at producing anything where ‘looks like a plausible answer that could be correct’ is the criterion; but in a game like Go, or in human engineering tasks such as building a plane or folding a protein, exactness and provable correctness based on mental models and reasoning about the world matters for the right result. An LLM will produce a plausible simulation of a wave or the orbit of a planet, but if you need numerical calculations to be exact or represent reality, you would not use an LLM, you would use a specific model designed for that use case. It is plausible that many of the returns from AI will come from humans putting large amounts of effort into specializing these more specific AI models towards important tasks, such as designing molecules, understanding the genome, and so on – not just from large LLMs.
One distinction that is helpful in thinking about this question is fluid intelligence versus crystallized intelligence. Fluid intelligence is the ability to reason, think abstractly, and solve novel problems independently of acquired knowledge; crystallized intelligence is more stored “book learning”, such as vocabulary or general knowledge. LLMs are high on crystallized intelligence, but remain low on fluid intelligence. A benchmark that tests this is Francois Chollet’s ARC (Abstraction and Reasoning Corpus), which aims to test fluid intelligence through reasoning tasks that are deliberately kept out of AI training data, and which LLMs continue to do poorly at. [5]
We should keep in mind the striking definition of ‘true’ AI by Hernandez-Orallo, paraphrasing McCarthy: “AI is the science and engineering of making machines do tasks they have never seen and have not been prepared for beforehand.” LLMs may change the world greatly: but it may be less than the hype implies.
Rounding up
With LLMs, we have clearly discovered a critical ingredient of intelligence. But many of the above considerations suggest that AGI-like architectures will contain several parts, and LLMs may play one role in a broader architecture. This would match how the human brain works: there are many distinct modules (the cerebellum, the thalamus, and so on) and they each specialize on different tasks, but with major overlaps. AGI-like systems may be similar.
One might also expect rich research in teams of LLMs coordinating around outputs to outperform individual LLMs. LLM “wisdom of the crowds” forecasts, for example, outperform single LLMs, and these technique could exploit the fact that multiple LLMs can be used cheaply to enhance output. [7]
Finally, none of these arguments are definitive; we simply do not know how things will go. The large amounts of investment and research invested by tech companies and governments will increase at a large pace this decade, and the arms race is already underway; kicking off this arms race may have been ChatGPT’s greatest legacy. This suggests we should prepare for a world in which we do get superhuman AI by the end of the decade – even if the chances are low.
[2] https://ai.meta.com/blog/openeqa-embodied-question-answering-robotics-ar-glasses/
[4] https://www.wired.com/2016/03/two-moves-alphago-lee-sedol-redefined-future/
[5] Francois Chollet, On the Measure of Intelligence (2019) https://doi.org/10.48550/arXiv.1911.01547
[6] Jose Hernandez-Orallo. Evaluation in artificial intelligence: from task-oriented to ability-oriented measurement. Artificial Intelligence Review, pages 397–447, 2017.
[7] Philipp Schoenegger et al, Wisdom of the Silicon Crowd: LLM Ensemble Prediction Capabilities Rival Human Crowd Accuracy (2024).https://arxiv.org/abs/2402.19379