AI Technology Explained

Transformer Architecture Explained Simply

Last updated: June 2025 · AI Pentium Editorial Team

Quick Summary

The Transformer, introduced by Vaswani et al. in 2017, replaced sequential RNN processing with parallel self-attention — allowing every token to directly interact with every other token in the sequence. This architectural breakthrough enabled training on massive datasets and is the foundation of every major LLM: GPT-4, Claude, Gemini, LLaMA, Mistral, and beyond.

Why Transformers Changed Everything

Before 2017, sequence modeling was dominated by Recurrent Neural Networks (RNNs) and Long Short-Term Memory networks (LSTMs). These process sequences step by step — each token depends on the hidden state of the previous token. This created two fundamental problems: sequential computation (impossible to parallelize during training) and information bottleneck (early-sequence information degrades as it passes through many steps).

The Transformer paper "Attention Is All You Need" (Vaswani et al., Google, 2017) replaced sequential processing with self-attention: every token attends to every other token simultaneously, in parallel, regardless of their distance. This enabled massive parallelization on GPUs and direct modeling of long-range dependencies — the two ingredients that made trillion-parameter models feasible.

The Building Blocks: Tokens and Embeddings

Input text is first tokenized into a sequence of token IDs. Each token ID is mapped to a dense vector (the token embedding) of dimension d_model (e.g. 768 for BERT-base, 12,288 for GPT-4). A positional encoding is added to each embedding to inject position information, since self-attention is otherwise permutation-invariant.

Original Transformers used fixed sinusoidal positional encodings. Modern LLMs typically use learned embeddings or relative position schemes like Rotary Position Embedding (RoPE), which extrapolate better to sequence lengths beyond those seen during training.

The Self-Attention Mechanism

Self-attention is the core operation of the Transformer. For each token in the sequence, it computes three vectors from the embedding: a Query (Q), a Key (K), and a Value (V) — each produced by a separate linear projection matrix.

The attention weight between token i and token j is computed as the scaled dot product of Q_i and K_j, divided by √d_k (to prevent vanishingly small gradients in high dimensions), then passed through softmax. The output for token i is the weighted sum of all Value vectors:

Attention(Q, K, V) = softmax(QK^T / √d_k) · V

This single equation allows each token to "look at" every other token in the sequence and gather information weighted by relevance — the mechanism behind all of ChatGPT's contextual understanding.

Multi-Head Attention (MHA)

A single attention head computes one set of Q/K/V projections. Multi-head attention runs h attention heads in parallel, each with its own Q, K, V weight matrices and operating in a lower-dimensional subspace (d_k = d_model / h). The outputs of all heads are concatenated and linearly projected back to d_model.

Why multiple heads? Each head can specialize in attending to different types of relationships: one head might track syntactic dependencies, another co-references, another long-range semantic relationships. GPT-3 has 96 attention heads per layer. Empirical analysis of attention patterns shows different heads learn distinctly interpretable behaviors.

The Feed-Forward Sublayer

After multi-head attention, each Transformer layer applies a position-wise feed-forward network (FFN): two linear transformations with a non-linear activation (ReLU or GELU) in between. The inner dimension is typically 4× d_model:

FFN(x) = max(0, xW₁ + b₁)W₂ + b₂

The FFN operates independently on each token position (no cross-token interaction). Research suggests FFN layers act as key-value memories, storing factual associations learned during pre-training. The FFN typically contains ~2/3 of a Transformer's total parameters.

Layer Normalization and Residual Connections

Each sublayer (attention and FFN) is wrapped with a residual connection (output = sublayer(x) + x) and layer normalization. The original paper applied LayerNorm after the residual (Post-LN). Modern LLMs (GPT-2+, LLaMA) apply it before (Pre-LN), which improves training stability for very deep networks.

Encoder vs Decoder: Two Architectural Variants

The original Transformer had both an encoder and decoder. Modern LLMs use one or the other:

VariantAttention TypePre-training ObjectiveBest ForExamples
Encoder-onlyBidirectional (all tokens attend to all)Masked LMClassification, NER, embeddingsBERT, RoBERTa
Decoder-onlyCausal (each token attends left only)Next-token predictionGeneration, coding, chatGPT-4, Claude, LLaMA, Mistral
Encoder-DecoderBidirectional encoder + causal decoderSpan corruptionTranslation, summarizationT5, BART

Scaling Laws

A landmark 2020 paper from OpenAI ("Scaling Laws for Neural Language Models", Kaplan et al.) showed that LLM performance improves predictably as a power law with model size (parameters N), dataset size (tokens D), and compute budget (FLOPs C). The key insight: you must scale all three together — a 10× bigger model trained on the same data is not as efficient as a model of intermediate size trained on proportionally more data.

The Chinchilla scaling laws (Hoffmann et al., DeepMind, 2022) refined this: for a given compute budget, the optimal strategy is roughly 1:1 ratio of model parameters to training tokens. This led to models like LLaMA-3 (8B parameters on ~15T tokens) outperforming earlier much-larger models on many benchmarks.

Modern Transformer Innovations

Further Reading

Frequently Asked Questions

What is self-attention in the Transformer?

Self-attention allows each token to directly attend to every other token in the sequence. It computes query (Q), key (K), and value (V) vectors for each token via learned linear projections, then computes attention weights as softmax(QK^T/√d_k) and returns the weighted sum of values. This enables direct modeling of long-range dependencies in O(n²) time and O(n²) memory.

What is the difference between BERT and GPT?

BERT uses a bidirectional encoder: tokens attend to both left and right context, making it ideal for understanding tasks (classification, NER, question answering). GPT uses a causal decoder: tokens attend only to previous tokens, making it ideal for text generation, coding, and conversation. All major chat LLMs (GPT-4, Claude, Gemini, LLaMA) are decoder-only.

Why does the Transformer not use RNNs?

RNNs process tokens sequentially (not parallelizable) and suffer from vanishing gradients over long sequences. Transformers process all tokens in parallel via self-attention, enabling GPU-efficient training and direct long-range interactions. The tradeoff is O(n²) attention complexity vs O(n) for RNNs, but hardware parallelism makes Transformers faster in practice for modern sequence lengths.

What is positional encoding?

Since self-attention is permutation-invariant, positional information must be injected explicitly. Original Transformers use fixed sinusoidal encodings. Modern LLMs use Rotary Position Embedding (RoPE) or ALiBi, which encode relative positions and extrapolate better to unseen sequence lengths.

How many parameters does a Transformer have?

Parameters scale with depth × width². GPT-2: 117M–1.5B. GPT-3: 175B. LLaMA-3 70B: 70B. GPT-4: ~1T (MoE, estimated). Most parameters live in the attention projection matrices and FFN layers.

Read the latest Transformer research

AI Pentium tracks new papers on attention mechanisms, efficient Transformers, and LLM architecture from arXiv daily.

Browse LLM papers How ChatGPT Works →