Progress in LLMs

Context Windows

January 3, 2026

When an LLM predicts the next word in a sequence, it relies on a mechanism called attention – the ability to focus on relevant parts of the input when generating each token. When you ask an LLM a question, its attention mechanism weighs which previous words are most important for determining what comes next.

For instance, if you write "The cat sat on the mat, and then it...", the attention mechanism helps the model understand that "it" likely refers back to "the cat" rather than "the mat".

The context window is the maximum amount of text an LLM can "pay attention to" at once. The model can only "see" and consider what fits within its context window when generating responses. Importantly, the context window is determined by architectural design choices — not by the number of model parameters. A 7-billion-parameter model could theoretically have a longer context window than a 70-billion-parameter model, depending on how each is designed.

Creating models with larger context windows has enabled them to maintain coherence across entire documents instead of short text snippets. Older models like GPT-2 (2019) had only a 1024-token context window and would lose track of topics, hallucinate unrelated content, and start to drift after reading or generating less than a single page of text.

To understand why scaling the context window was so computatinoally heavy, we must better understand how LLMs process sequences of text. Modern transformer-based LLMs rely on the self-attention mechanism, introduced in the seminal 2017 paper "Attention Is All You Need" (Vaswani et al.).

In self-attention, the model computes a pairwise attention score between every token and every other token. For a text sequence of n tokens, this results in $n \times n$ attention pairs that need to be calculated. So for a sequence of 100,000 tokens, this means computing 10 billion pairwise attention scores.

As a consequence, when you double context length, you need to compute 4x as many attention pairs (which requires 4x the memmory to store everything, while the computations will take 4x the amount of time). This quadratic scaling that becomes increasingly more computationally expensive as the sequence length increases.

To allow LLMs to stay more coherent, and process more input text, meant their context windows had to be extended. But in order to achieve this, the transformer's fundamental O(n²) scaling problem had to be overcome. Several directions of research maaged to make contributions to this.

Sparse and Approximate Attention: One approach was to instead of calculating all of the $n \times n$ attention pairs (sometimes referred to as 'full' or 'dense' attention), to only computing and work with attention for some token pairs. The idea is that in order to predict the next word in a sentence like "The cat sat on the mat, and then it...", we do not need to know the attention between every single pair. As an example, several approaches that were explored were combianations of ideas like the following:

Local windowed attention: Each token inside the context window attends only to nearby neighbors (for example 512 tokens on each side). This captures local syntactic and semantic relationships efficiently without needing to calculate the attention between every pair in the total context window.
Dilated sliding windows: By using 'skip patterns' (attending to every 2nd, 4th, 8th token, etc.) it is possible to still capture longer-range dependencies without computing full attention.
Global attention: By injecting certain special tokens inside the text that attend to all other tokens, it is still possible to store some notion of information flow across the entire sequence.

Innovations like this allowed later generations of LLMs to work with much larger context windows. GPT-3 (2020) aready used a combination of 'full dense' attention and sparse attention, that allowed it to extend its context window from 1024 to 2048 tokens. But larger context windows doesn't mean denser attention, as research went on, we found out how we could grow attention windows in size, y letting the attention become increasingly sparse using combinations of the above methods. While maintaining as much data from the original text as possible.

Further optimisations allowed to push this even further.

Locality-Sensitive Hashing (LSH): typically only the attention from a handful of tokens actually matters when predicting the next word. Most attention weights end up being near-zero and contribute almost nothing. When we find that the attention of a particular token doesn't contribute to prediction of the next word, we can lookup all other tokens whose attention would have a smilar negligable contribution, and ignore all of them to speed up the calculations we need to do.
FlashAttention (Dao et al., 2022) and FlashAttention-2 (2023): Rather than reducing computational complexity, FlashAttention optimizes memory access patterns by leveraging GPU memory hierarchy. By "tiling" the attention computation—loading blocks of Q, K, V from high-bandwidth memory (HBM) to fast on-chip SRAM, computing attention for those blocks, and writing back to HBM—it reduces memory I/O from quadratic to linear in sequence length. This achieves 2-9× speedup over standard implementations and enables processing of much longer sequences within hardware memory constraints. FlashAttention-2 reaches up to 230 TFLOPs/s on A100 GPUs (nearly 70% of theoretical maximum).

Including these further optimisation techniques, like FlashAttention allowed GPT-4 (2023) to have an attention window of up to 128k tokens!

As research and optimiations went on, modern LLM models like Gemini 3 has an attention window of over 1 million tokens, equivalent to several hundred pages of text. This increased context window reduces the amount of drift and hallucinations (though it doesn't reduce it to zero), and allows these models to stay more coherent across entire long documents or files in a codebase.

Back in 2018 it looked like this 1.000x level of scaling in the attention windows might have required a 1.000.000 x scaling in memmory and computation power, but in less than 5 years additional optimisations and research have made it possible to run this.

The shift from 2k to 2M tokens completely transformed what's possible: developers can now process entire codebases (enabling agentic coding assistants), analyze 300+ page documents in a single pass, and maintain coherent conversations across hundreds of chat responses.

Continue reading:Reasoning