The Limits of LLM Based Systems

Can LLMs lead us to AGI?

December 20, 2025

The impact of progress in LLM improvements has become most noticable in the period from late 2024 into early 2025, with the appearance of the first agentic systems. Earlier generations of LLM-based chatbots (from 2022 and before) still responded directly with text output to a user's input, relying primarily on the scale of the model itself for improved accuracy and quality.

As impressive as recent progress has been, critics - most prominently AI pioneer Yann LeCun - argue that only using LLMs (as we understood them around 2022) might not be sufficient to get to AGI. They point to a fundamental limitation: LLMs are ultimately designed to predict the next most likely token in a sequence. This implies that as they generate longer outputs — whether reasoning through a problem or writing extended text — they become exponentially more likely to drift into incoherence or accumulate errors. Each prediction is based on statistical patterns rather than genuine understanding, and small mistakes can compound as the model continues generating more output.

Yet despite these predicted limitations, the improvements and enhanced capabilities of LLM-based agents from 2025 have been astonishing compared to the chatbots from 2022, and progress doesn't yet seem to slow down.

This does soud like a paradox: we see remarkable capabilities emerging from scaled-up and optimised LLMs, yet they appear to have inherent architectural limitations. Are there systems inherently limited? Or can these systems be scaled to general intelligent agents? Can both perspectives be true simultaneously? And if so, what should we expect for these systems?

Understanding whether this current trajectory may leads to artificial general intelligence or hits fundamental limits isn't just an academic question. It shapes how we should invest in AI research, how and where we should deploy these systems within society, and how best to manage expectations for the future on what AI can and cannot achieve.

To better understand the arguments why LLM agent scalability might be limited to relatively simple tasks, we must examine in a bit more detail how these systems actually work. At their core, modern LLM agents are large language models that generate text one token at a time. Each new "token" is predicted based on all previous tokens in the sequence. The term "token" here refers to subword units: a single word might be split into multiple tokens like "un", "predict", "able". This sequential generation process is called "autoregressive"—each token depends on all previous tokens in an ever-growing chain.

The LLM agent architecture builds on this foundation with several key enhancements:

Extended Context Windows: This increased context window reduces the amount of drift and hallucinations (though it doesn't reduce it to zero), and allows these models to stay more coherent across multiple documents or files in a codebase. Nevertheless, as the size of the documents or codebase grows, the likelihood of the LLM agent starting to drift increases again.

Chain-of-Thought and "Thinking Tokens": These longer context windows have enabled modern agents to use internal reasoning steps. Instead of letting the model predict the output tokens as a final answer, modern models run intermediate steps to work through the problem step by step. These internal monologues are not shown to users but help to further keep the model on track and guide the model toward more coherent outputs.

This mechanism creates a form of recursive self-improvement. The agent generates internal reasoning steps, uses these steps as context to generate more refined reasoning steps, and eventually produces an answer. This process can then be used recursively: the agent generates a plan, executes part of it, observes the results, generates a revised plan, and continues iterating.

Does This Hit a Limit? Despite all of the above improvements. Each reasoning step, each next token is still generated through next-token prediction. The model is always asking: "Given everything that came before, what token is most likely to come next?" It does not truly "understand" in the way humans do; it pattern-matches against the vast corpus of text it was trained on. And as a consequence, there will always be a small chance — however small it may be — that the next token will be wrong.

Yann LeCun, among others, points to the following fundamental limitation in LLM agents their design: If e is the probability that any single token takes us outside the space of correct answers, then the probability that a response of length n remains correct is P = (1 - e)^n. This probability diverges exponentially as we generate more tokens. Meaning longer outputs become exponentially more likely to drift into incoherence and error, rather than construct structured correct answers to a problem.

This means that even when we use methods like 'Thinking Tokens' to better structuring the output of these LLM agents, and supress their halucinations, it will still become exponentially more likely that these models still drift back into some incoherence and error as they think for longer. As we let an LLM agent perform more internal reasoning steps, with the hope of using this internal reasoning to better structure its final output, small errors may still compound. This, so the critics say, then implies that any complex task performed by an LLM agent will always contain intermediate steps in the output that will be inherently flawed and contain inconsistencies and errors. And will then cause LLM agents to be fundamentally doomed to fail when trying to perform any complex task.

This exponential error accumulation problem manifests in observable ways. When you give an agent a vague instruction like "create a music recommendation app," there are countless valid architectural choices: which framework, which database, which API structure, how to organize files, what naming conventions to use. The agent picks a plausible path and starts generating. But each decision narrows the space of coherent next decisions. If the agent makes a slightly suboptimal choice in step 3, it must pick up on that and rewrite it during step 4, or else work around it which constrains step 5, and so on.

The problem arises particularly when the agent must choose between multiple valid but different approaches, or when it needs to maintain a non-standard pattern established earlier in the generation. Here, the "most likely next token" heuristic breaks down. The model tends to regress toward the mean of its training distribution, toward the most common patterns it has seen, even when the specific context calls for something different.

Not all token sequences are equally likely to introduce errors. Well-established patterns like common code patterns or standard technical jargon may be easier to continue correctly. When an agent operates within these well-defined paths, the error probability e for each token is very low, and (1 - e)^n remains acceptably high even for large n.

When performing standard tasks error rates are vanishingly small, and we can get these LLM agents to perform reliable work for us. But when maintaining a novel architectural pattern across dozens to hundreds of files, or reasoning through an unusual edge case, the model has to continuusly work against its training distribution's gravitational pull toward common patterns. This is why LLM agents are excellent at generating boilerplates, perform reasonable when implementing well-defined specifications, but struggle with maintaining coherent novel architectures or reasoning through new problems.

Consequently, it looks like it might become possible for these LLM agents be extremely efficient at performing well-defined routine tasks. Yet, at the same time they might indeed be severely limited in their capacity to break free from established patterns and generate truly novel ideas, generate new knowledge, or work on novel research projects.

References

Wired (2026).The Math on AI Agents Doesn’t Add UpSource

Continue reading:Why Critics Remain Skeptical