Paper: Attention Is All You Need — Vaswani et al., Google Brain, 2017 Link: arxiv.org/abs/1706.03762
One-line summary
Replace recurrence and convolution entirely with self-attention — and get a model that trains faster, parallelises better, and handles long-range dependencies more effectively than anything before it.
Why it still matters in 2026
Every major language model you work with today — GPT-4, Claude, Gemini, Llama, Mistral — is a transformer. The architecture introduced in this paper is not a stepping stone to something else. It is the thing.
Understanding it doesn’t mean you’ll retrain GPT. It means you stop treating these models as black boxes and start reasoning about why they behave the way they do: why context window size is a hard constraint, why certain prompting patterns work, why attention-based systems struggle with very long documents in specific ways, and what trade-offs you’re accepting when you choose one model over another.
For an applied engineer in 2026, this paper is vocabulary. You can’t have a serious architectural conversation without it.
Key ideas, explained simply
The problem with RNNs
Before transformers, sequence-to-sequence tasks (translation, summarisation, text generation) were dominated by RNNs — recurrent neural networks. The key limitation: RNNs process tokens sequentially. Token 1, then token 2, then token 3. To understand token 100, the model has to have correctly carried forward information through 99 prior steps.
This creates two problems:
- Training is slow — you can’t parallelise across a sequence
- Long-range dependencies degrade — information from early tokens gets diluted or lost by the time you reach later ones
The transformer’s answer: attention
The transformer throws out recurrence entirely. Instead, every token attends to every other token in the sequence simultaneously. To understand token 100, the model doesn’t need to traverse 99 steps — it can directly look at any other token in one operation.
This is self-attention: each token asks “which other tokens in this sequence are relevant to understanding me?” and gets a weighted answer.
How self-attention actually works
Each token is projected into three vectors:
- Query (Q) — “what am I looking for?”
- Key (K) — “what do I contain?”
- Value (V) — “what do I contribute if you attend to me?”
Attention scores are computed as dot products between queries and keys, scaled and softmaxed into weights, then used to produce a weighted sum of values. In matrix form:
Attention(Q, K, V) = softmax(QKᵀ / √d_k) · V
The √d_k scaling prevents the dot products from growing too large and pushing softmax into regions with vanishing gradients.
Multi-head attention
Rather than computing attention once, the paper runs it in parallel across h separate “heads” — each with its own Q, K, V projections. Each head can learn to attend to different types of relationships simultaneously: one head might track syntactic structure, another semantic similarity, another positional proximity.
The outputs are concatenated and projected back down. More representation power without a proportional increase in parameters.
Positional encoding
Self-attention is order-agnostic by default — “the cat sat on the mat” and “the mat sat on the cat” would look identical to pure attention. To inject position information, the paper adds a positional encoding to each token embedding before any attention is computed. The original paper uses fixed sinusoidal encodings; modern models mostly learn these instead.
The encoder-decoder structure
The original paper was designed for translation (English → German/French). It uses:
- An encoder that reads the input and builds a rich contextual representation
- A decoder that generates the output sequence token by token, attending both to its own prior outputs and to the encoder’s representation
Most LLMs you use today are decoder-only — the encoder-decoder split is more common in tasks with a distinct input and output (summarisation, translation, structured extraction).
Why parallelisation matters so much
With RNNs, training required sequential computation across the entire sequence. Transformers compute all attention weights in parallel — which means modern GPU/TPU hardware can be utilised far more efficiently. This was the practical unlock that made training at scale feasible.
What it means for enterprise AI
Context windows are a direct consequence of attention cost. Self-attention scales quadratically with sequence length — double the tokens, quadruple the compute. This is why context limits exist and why extending them is expensive. When you’re designing a RAG pipeline or a long-document processing system, you’re navigating a constraint that comes directly from this architecture.
Attention patterns are inspectable. Unlike the hidden states of RNNs, attention weights are explicit. This is what makes attention-based interpretability research possible — and it’s why some enterprise governance frameworks can at least gesture at explaining model behaviour in ways that weren’t possible before.
The encoder-decoder split maps to different task types. If you’re choosing between model families for a specific task — extraction, generation, classification — understanding what architecture they use and why helps you make better decisions. A decoder-only model is not always the right tool.
Positional encoding is where a lot of long-context limitations originate. Sinusoidal encodings don’t generalise well beyond the sequence lengths seen in training. This is why techniques like RoPE (Rotary Position Embedding) and ALiBi exist — they’re attempts to solve a problem that’s baked into the original design.
My take
This paper is remarkable not because it introduced attention — attention mechanisms existed before — but because it had the conviction to eliminate everything else. No recurrence. No convolution. Just attention and feed-forward layers. That minimalism is what made it generalisable.
What strikes me most re-reading it in 2026 is how many of the current limitations in production LLMs trace directly back to decisions made here. Context window costs, positional generalisation problems, the encoder/decoder architecture divide — these are not bugs that got introduced later. They are the original design, and they were made before anyone knew this architecture would underpin systems at the scale we see today.
The authors were solving machine translation. They got something much larger.
One thing worth being honest about: for most applied AI engineering work, you don’t need to implement this. You need to understand it well enough to reason about the systems built on top of it. There’s a version of reading this paper where you get lost in the notation and walk away feeling like you learned nothing useful. The goal is the opposite — build the mental model, then use it to make better decisions in the work you’re actually doing.
Week 2 coming next Sunday. All paper reviews are tagged paper-review.