Progress in LLMs

Reasoning

January 3, 2026

As the quality of the model output improved with larger modesls, and as context windows became larger, allowing the output to stay coherent for longer, it was observed as early as 2022 it had already been observed [Wei et al., Google, 2022] that adding simple instructions to the input, such as “Let's think step by step” could significantly improve the accuracy and reasoning performance of output generated by LLM-based chatbots.

The famous 'strawberry test' illustrates the idea behind this:

When asked a basic counting question such as: "How many r’s are in the word “strawberry”?" earlier models behaved like typical next-token predictors. And would often struggle from hallucnations GPT-4 (single-turn answer): "The word ‘strawberry’ has two ‘r’s." In contrast, by letting the model first break down the problem using internal reasoning steps (that don't have to be shown to the user) the final output can become more accurate: GPT-5 (reasoning + final answer): Reasoning Output: Counting letters in “strawberry”… The letters are: s, t, r, a, w, b, e, r, r, y. I find ‘r’ in positions 3, 8, and 9, so there are 3 occurrences. The user asked for the count, which is 3. Final Output: Three.

Research found that models smaller than approximately 100 billion parameters not only don't benefit from CoT prompting but often perform worse than standard prompting, producing fluent but illogical or incoherent chains of thought. Reasoning emerges as an ability at approximately 100 billion parameters, with consistent performance gains only observed in very large models Language Models Perform Reasoning via Chain of Thought

By letting a model employ multiple internal reasoning steps, generate intermediate thoughts, and feeding the results back into the input, and iterating. This allowed LLMs to create a form of internal dialogue that enbles the model to revise, and self-correct before committing to a final output. Anthropic’s Claude 3.5 Sonnet (June 2024) was one of the first to implement this 'thinking mode' for it's paying users, this followed by OpenAI’s o1-preview (September 2024), Google’s Gemini 2.0 (December 2024). Though it is more expensive to run, this quickly became the new standard across all labs (also for free demo users) after DeepSeek’s R1 (January 2025) gained a lot of positive media attention.

Continue reading:Structured Output