Transformers

Nearly Orthogonal Vectors

February 19, 2025

In the previous section we saw that encoders can map each word to a point in a high-dimensional space, and that the geometry of that space might captures semantic relationships. A natural follow-up question is: how many dimensions do we actually need?

You might expect the answer to be: "at least as many dimensions as there are concepts you want to represent." But in practice, word embeddings work extremely well with just 300 to 1000 dimensions. How is that possible?

Orthogonal Vectors

Two vectors are orthogonal if they point in completely independent direction, at exactly a 90-degree angle to each other.

In terms of a latent space, orthogonal embeddings are ideal: if two words have orthogonal representations, they are completely independent, and knowing something about one tells you nothing about the other.

If you have a vocabulary of 100,000 words, you'd need 100,000-dimensional embedding to give each word its own independent direction.

But working with such large numbers would require vast computational resources.

Fortunately, not each of word corresponds to an independent concept.

A group of words like 'building', 'house', 'appartment', and 'flat' have similar meanings. Other groups of words like 'kid' and 'child' are even more similar. And we might be able to capture all of these concepts with a fewer number of dimensions than the number of words.

This would reduce the total number of required dimensions a little bit, but we would probably still need 10,000 dimensions or more.

In practice, word embeddings work extremely well with just 300 to 1000 dimensions. How is that possible?

The answer lies in a beautiful property of high-dimensional geometry: nearly orthogonal vectors.

Nearly Orthogonal Vectors

In dd dimensions, you can have at most dd mutually orthogonal vectors. In 3D space, that's just three: the x, y, and z axes. So if we needed each word to be exactly orthogonal to every other word, a 300-dimensional embedding could only handle 300 words. That is clearly not what's happening.

The key insight is that we don't need exact orthogonality, we just need vectors to be nearly orthogonal, meaning their dot product is very close to zero. If two word vectors are nearly orthogonal, they barely "interfere" with each other, and can still be treated as effectively independent.

In low dimensions, relaxing the requirement from exact to near orthogonality doesn't gain you much. In 2D, 3D, 4D, 5D, etc. whether you require exactly 90° or allow 85°, you can still only fit a handful of vectors before they start pointing in similar directions.

But in high dimensions, something remarkable happens.

As the number of dimensions grows, the number of nearly orthogonal vectors you can pack into that space grows exponentially.

Here is the intuition. If you pick a random unit vector in a thousand-dimensional space, it is almost certainly nearly orthogonal to any other vector you've already chosen, just by chance. There is so much room in high dimensions that nearly all random directions end up nearly perpendicular to each other.

More precisely: in a space with dd dimensions, you can fit roughly eαde^{\, \alpha \cdot d} unit vectors such that every pair has a dot product close to zero. That eαde^{\, \alpha \cdot d} is exponential in dd.

A jump from 300 to 1000 dimensions does not give you 700 more directions (to store 700 additional concepts), it gives you exponentially more capacity.

What This Means for Language Models

This explains something that would otherwise be puzzling: an encoder like Word2Vec that can map millions of words perfectly well onto a 300-dimensional embedding. The network does not need a separate, fully independent dimension for every word. It can store thousands of words in slightly overlapping, nearly orthogonal directions. Each semantic dimension — gender, geography, tense, sentiment, domain — occupies approximately one direction in this space.

This principle extends even further. Research — specifically the superposition hypothesis — suggests that trained neural networks use this same principle to represent far more features than they have neurons. Rather than dedicating one neuron (or one dimension) to one concept, networks can store concepts in nearly orthogonal directions that span across many neurons at once.

The exponential scaling of near-orthogonal directions means that a modest number of neurons can, in principle, represent an enormous number of distinct concepts.

This is part of what makes large neural networks so surprisingly powerful: the high-dimensional geometry of their activations gives them representational capacity that grows exponentially with their size.