Transformers and Attention¶

Before transformers, language models processed text one word at a time — slow, forgetful, and unable to scale. The 2017 paper "Attention Is All You Need" replaced recurrence with a single mechanism: self-attention. Every modern LLM — GPT-4o, Claude, Llama, Gemini — is a transformer.

Learning objectives¶

Understand why transformers replaced RNNs for language modeling
Explain self-attention, multi-head attention, and positional encoding
Map the data flow through a transformer block
Name the key architectural variants used in 2025 production models

Why transformers replaced RNNs¶

Recurrent Neural Networks (RNNs) processed tokens sequentially: word 1 → word 2 → word 3. Two problems:

Can't parallelize training — each step depends on the previous one
Vanishing gradient — information from early tokens fades over long sequences

Transformers process the entire sequence simultaneously. Token 1 and token 500 are computed at the same time. This is why you can train a 70B-parameter model on thousands of GPUs — the work is embarrassingly parallel.

The transformer block¶

A transformer model is a stack of identical layers. Each layer has two sub-components:

Input Tokens
    ↓
[Embedding + Positional Encoding]
    ↓
┌─────────────────────────────┐
│  Multi-Head Self-Attention  │  ← "Who should I pay attention to?"
│  + Add & LayerNorm          │
├─────────────────────────────┤
│  Feed-Forward Network       │  ← "What do I think about this?"
│  + Add & LayerNorm          │
└─────────────────────────────┘
    ↓
[Repeat N times — GPT-4 has ~96 layers]
    ↓
Linear + Softmax → Next Token Probabilities

Self-attention: the core idea¶

Self-attention lets every token ask: "Which other tokens in this sequence are most relevant to understanding me right now?"

It does this with three learned matrices applied to each token embedding:

Matrix	Role	Analogy
Q (Query)	What am I looking for?	A search query
K (Key)	What do I offer?	A document title
V (Value)	What do I actually contain?	The document body

The attention score between token i and token j:

score(i, j) = softmax( Q_i · K_j / √d_k )

d_k is the key dimension (scaling prevents exploding gradients in large models)
Softmax turns raw scores into probabilities that sum to 1
The output for token i is a weighted sum of all Value vectors

Example: In the sentence "The bank by the river flooded", the word "bank" attends strongly to "river" and "flooded" to resolve the ambiguity — financial institution or riverbank.

import torch
import torch.nn.functional as F

def scaled_dot_product_attention(Q, K, V):
    """Core attention operation — same math as every transformer."""
    d_k = Q.size(-1)
    scores = torch.matmul(Q, K.transpose(-2, -1)) / (d_k ** 0.5)
    weights = F.softmax(scores, dim=-1)
    return torch.matmul(weights, V), weights

# Toy example: 4 tokens, 8-dim embeddings
seq_len, d_model = 4, 8
Q = torch.randn(seq_len, d_model)
K = torch.randn(seq_len, d_model)
V = torch.randn(seq_len, d_model)

output, attn_weights = scaled_dot_product_attention(Q, K, V)
print(f"Output shape: {output.shape}")        # (4, 8)
print(f"Attention weights:\n{attn_weights}")   # 4×4 matrix

Multi-head attention¶

Running attention once gives one "perspective." Running it h times in parallel — each with different Q/K/V projections — lets the model capture multiple types of relationships simultaneously:

Head 1: syntactic dependencies ("subject → verb")
Head 2: coreference ("she" → "Alice")
Head 3: semantic similarity
Head N: whatever the model learned was useful

import torch
import torch.nn as nn

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model=512, num_heads=8):
        super().__init__()
        assert d_model % num_heads == 0
        self.d_k = d_model // num_heads
        self.num_heads = num_heads
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

    def forward(self, x):
        B, T, C = x.shape
        # Project and split into heads
        Q = self.W_q(x).view(B, T, self.num_heads, self.d_k).transpose(1, 2)
        K = self.W_k(x).view(B, T, self.num_heads, self.d_k).transpose(1, 2)
        V = self.W_v(x).view(B, T, self.num_heads, self.d_k).transpose(1, 2)

        # Attention per head
        scores = (Q @ K.transpose(-2, -1)) / (self.d_k ** 0.5)
        weights = torch.softmax(scores, dim=-1)
        out = weights @ V

        # Concatenate heads and project
        out = out.transpose(1, 2).contiguous().view(B, T, C)
        return self.W_o(out)

# GPT-3 scale: 96 layers, 96 heads, 12288-dim embeddings
mha = MultiHeadAttention(d_model=512, num_heads=8)
x = torch.randn(2, 10, 512)  # batch=2, seq=10, d_model=512
print(mha(x).shape)  # (2, 10, 512)

Positional encoding¶

Self-attention treats the sequence as a bag of tokens — it has no built-in sense of word order. Positional encoding adds order information to each token embedding.

Two strategies in production models:

Strategy	Used by	How it works
Sinusoidal (absolute)	Original transformer, BERT	Fixed sine/cosine patterns added to embeddings
RoPE (Rotary Position Embedding)	Llama 3, Mistral, GPT-NeoX	Rotates Q and K vectors — enables longer context extrapolation

RoPE is now the dominant approach because it generalizes better to sequences longer than those seen during training — crucial for the 128K–1M context windows of modern models.

The feed-forward network¶

After attention, each token passes through a two-layer MLP independently:

FFN(x) = GELU( x · W₁ + b₁ ) · W₂ + b₂

The hidden dimension is typically 4× the model dimension. In GPT-3 (12,288-dim), the FFN hidden layer has ~49,152 neurons. This is where most of the model's parameters live — and where factual knowledge is thought to be stored.

The "knowledge" question

Research suggests attention layers learn relationships (syntax, coreference) while FFN layers store facts (Paris is the capital of France). This is why fine-tuning often focuses on FFN weights.

2025 architectural variants¶

Production LLMs have evolved beyond the original transformer:

Innovation	What it does	Models using it
GQA (Grouped-Query Attention)	Multiple query heads share one K/V head — reduces memory during inference	Llama 3, Mistral, GPT-4o
MLA (Multi-Head Latent Attention)	Compresses K/V into a low-rank latent space — shrinks KV cache by 5–13×	DeepSeek-V3, DeepSeek-R1
FlashAttention-3	Reorders GPU memory accesses for 2–3× speedup — no change to output	Used in most production inference stacks
Mixture of Experts (MoE)	Only a subset of FFN layers ("experts") activate per token — sparse computation	GPT-4, Mixtral 8×7B, DeepSeek
SwiGLU activation	Replaces GELU in FFN — better loss curves	Llama, PaLM, Gemma

Why GQA matters for you

GQA directly affects inference cost. A model with GQA can serve longer contexts with less GPU memory — which is why Llama 3 70B can run on a single A100 where earlier 70B models could not.

Encoder-only vs decoder-only vs encoder-decoder¶

Architecture	Training objective	Best for	Examples
Encoder-only	Masked language modeling (predict masked tokens)	Classification, embeddings, search	BERT, RoBERTa
Decoder-only	Causal language modeling (predict next token)	Text generation, chat, reasoning	GPT-4o, Claude, Llama 3
Encoder-decoder	Seq2seq (encode input, decode output)	Translation, summarization	T5, BART, Flan-T5

All the models you'll use in this course — GPT-4o, Claude, Llama 3 — are decoder-only. The encoder is not needed when the model sees the full input before generating.

Common misconception

"ChatGPT uses BERT" — No. BERT is encoder-only and can't generate text. GPT-4o is decoder-only. They're fundamentally different architectures despite both being transformers.

Key numbers to know¶

Model	Layers	Heads	d_model	Parameters
GPT-2	12	12	768	117M
GPT-3	96	96	12,288	175B
Llama 3 8B	32	32	4,096	8B
Llama 3 70B	80	64	8,192	70B

Bigger is not always better — Llama 3 8B outperforms GPT-3 on most benchmarks despite being 20× smaller, thanks to better training data and RLHF.

Common mistakes¶

Attention is O(n²) in sequence length

The attention matrix is seq_len × seq_len. Doubling the context window quadruples the memory. This is why FlashAttention and GQA exist — and why you should profile your RAG chunk sizes carefully.

More heads ≠ better

Increasing the number of attention heads beyond the model's capacity wastes compute. The optimal head count depends on d_model and data. Don't tune this unless you're training from scratch.

What's next¶

Understanding attention is necessary but not sufficient — you also need to understand how the model reads your input. That starts with tokenization.

02-tokenization | 00-agenda