Context Windows

Progress in LLMs

January 3, 2026

When an LLM predicts the next word in a sequence, it relies on a mechanism called attention – the ability to focus on relevant parts of the input when generating each token. When you ask an LLM a question, its attention mechanism weighs which previous words are most important for determining what comes next.

For instance, if you write "The cat sat on the mat, and then it...", the attention mechanism helps the model understand that "it" likely refers back to "the cat" rather than "the mat".

The context window is the maximum amount of text an LLM can "pay attention to" at once. The model can only "see" and consider what fits within its context window when generating responses. Importantly, the context window is determined by architectural design choices — not by the number of model parameters. A 7-billion-parameter model could theoretically have a longer context window than a 70-billion-parameter model, depending on how each is designed.

Creating models with larger context windows has enabled them to maintain coherence across entire documents instead of short text snippets. Older models like GPT-2 (2019) had only a 1024-token context window and would lose track of topics, hallucinate unrelated content, and start to drift after reading or generating less than a single page of text.

To understand why scaling the context window was so computatinoally heavy, we must better understand how LLMs process sequences of text. Modern transformer-based LLMs rely on the self-attention mechanism, introduced in the seminal 2017 paper "Attention Is All You Need" (Vaswani et al.).

In self-attention, the model computes a pairwise attention score between every token and every other token. For a text sequence of n tokens, this results in $n \times n$ attention pairs that need to be calculated. So for a sequence of 100,000 tokens, this means computing 10 billion pairwise attention scores.

As a consequence, when you double context length, you need to compute 4x as many attention pairs (which requires 4x the memmory to store everything, while the computations will take 4x the amount of time). This quadratic scaling that becomes increasingly more computationally expensive as the sequence length increases.

To allow LLMs to stay more coherent, and process more input text, meant their context windows had to be extended. But in order to achieve this, the transformer's fundamental O(n²) scaling problem had to be overcome. Several directions of research maaged to make contributions to this.

Sparse and Approximate Attention: One approach was to instead of calculating all of the $n \times n$ attention pairs (sometimes referred to as 'full' or 'dense' attention), to only computing and work with attention for some token pairs. The idea is that in order to predict the next word in a sentence like "The cat sat on the mat, and then it...", we do not need to know the attention between every single pair. As an example, several approaches that were explored were combianations of ideas like the following:

Innovations like this allowed later generations of LLMs to work with much larger context windows. GPT-3 (2020) aready used a combination of 'full dense' attention and sparse attention, that allowed it to extend its context window from 1024 to 2048 tokens. But larger context windows doesn't mean denser attention, as research went on, we found out how we could grow attention windows in size, y letting the attention become increasingly sparse using combinations of the above methods. While maintaining as much data from the original text as possible.

Further optimisations allowed to push this even further.

Including these further optimisation techniques, like FlashAttention allowed GPT-4 (2023) to have an attention window of up to 128k tokens!

As research and optimiations went on, modern LLM models like Gemini 3 has an attention window of over 1 million tokens, equivalent to several hundred pages of text. This increased context window reduces the amount of drift and hallucinations (though it doesn't reduce it to zero), and allows these models to stay more coherent across entire long documents or files in a codebase.

Back in 2018 it looked like this 1.000x level of scaling in the attention windows might have required a 1.000.000 x scaling in memmory and computation power, but in less than 5 years additional optimisations and research have made it possible to run this.

The shift from 2k to 2M tokens completely transformed what's possible: developers can now process entire codebases (enabling agentic coding assistants), analyze 300+ page documents in a single pass, and maintain coherent conversations across hundreds of chat responses.

Continue reading:Reasoning