What is positional encoding in the Transformer?

Since self-attention is permutation-invariant (it has no built-in sense of token order), position information must be injected explicitly. The original Transformer uses fixed sinusoidal positional encodings added to the token embeddings. Modern models use learned positional embeddings (BERT, GPT-2) or relative position encodings (Rotary Position Embedding / RoPE in LLaMA, Mistral) that generalize better to sequences longer than seen during training.

AI Technology Explained

Transformer Architecture Explained Simply

Q: What is self-attention in the Transformer?

Self-attention is a mechanism that allows each token in a sequence to directly attend to every other token, computing a weighted sum of value vectors where weights are determined by the dot-product similarity between query and key vectors. This allows the model to dynamically focus on the most relevant parts of the input for each token, regardless of their distance in the sequence — solving the long-range dependency problem that RNNs struggled with.

Q: What is the difference between BERT and GPT?

BERT uses a bidirectional encoder-only Transformer: each token attends to all other tokens (both left and right context). BERT is pre-trained with masked language modeling (predict masked tokens) and is best for classification, NER, and question answering tasks. GPT uses a causal decoder-only Transformer: each token only attends to previous tokens (left context only). GPT is pre-trained with next-token prediction and is best for text generation, coding, and conversational tasks.

Q: Why does the Transformer not use RNNs?

RNNs process sequences step by step, which means: (1) they cannot be parallelized during training, (2) information from early tokens must pass through many sequential steps before reaching later tokens (vanishing gradient / information bottleneck). Transformers replace sequential processing with parallel self-attention, allowing all token pairs to interact directly in O(1) steps regardless of distance. This makes training far faster and enables learning of much longer-range dependencies.

Q: How many parameters does a Transformer have?

Parameter count scales with model depth (number of layers), width (hidden dimension size), and vocabulary size. GPT-2: 117M to 1.5B parameters. GPT-3: 175B parameters. LLaMA 3: 8B to 70B. GPT-4: estimated ~1T (mixture of experts). Claude 3.5 Sonnet: undisclosed. Parameters are primarily in the attention weight matrices (Q, K, V, O projections) and the feed-forward network sublayers.

Last updated: June 2025 · AI Pentium Editorial Team

How ChatGPT Works LLMs RAG Computer Vision

Quick Summary

The Transformer, introduced by Vaswani et al. in 2017, replaced sequential RNN processing with parallel self-attention — allowing every token to directly interact with every other token in the sequence. This architectural breakthrough enabled training on massive datasets and is the foundation of every major LLM: GPT-4, Claude, Gemini, LLaMA, Mistral, and beyond.

Why Transformers Changed Everything

Before 2017, sequence modeling was dominated by Recurrent Neural Networks (RNNs) and Long Short-Term Memory networks (LSTMs). These process sequences step by step — each token depends on the hidden state of the previous token. This created two fundamental problems: sequential computation (impossible to parallelize during training) and information bottleneck (early-sequence information degrades as it passes through many steps).

The Transformer paper "Attention Is All You Need" (Vaswani et al., Google, 2017) replaced sequential processing with self-attention: every token attends to every other token simultaneously, in parallel, regardless of their distance. This enabled massive parallelization on GPUs and direct modeling of long-range dependencies — the two ingredients that made trillion-parameter models feasible.

The Building Blocks: Tokens and Embeddings

Input text is first tokenized into a sequence of token IDs. Each token ID is mapped to a dense vector (the token embedding) of dimension d_model (e.g. 768 for BERT-base, 12,288 for GPT-4). A positional encoding is added to each embedding to inject position information, since self-attention is otherwise permutation-invariant.

Original Transformers used fixed sinusoidal positional encodings. Modern LLMs typically use learned embeddings or relative position schemes like Rotary Position Embedding (RoPE), which extrapolate better to sequence lengths beyond those seen during training.

The Self-Attention Mechanism

Self-attention is the core operation of the Transformer. For each token in the sequence, it computes three vectors from the embedding: a Query (Q), a Key (K), and a Value (V) — each produced by a separate linear projection matrix.

The attention weight between token i and token j is computed as the scaled dot product of Q_i and K_j, divided by √d_k (to prevent vanishingly small gradients in high dimensions), then passed through softmax. The output for token i is the weighted sum of all Value vectors:

Attention(Q, K, V) = softmax(QK^T / √d_k) · V

This single equation allows each token to "look at" every other token in the sequence and gather information weighted by relevance — the mechanism behind all of ChatGPT's contextual understanding.

Multi-Head Attention (MHA)

A single attention head computes one set of Q/K/V projections. Multi-head attention runs h attention heads in parallel, each with its own Q, K, V weight matrices and operating in a lower-dimensional subspace (d_k = d_model / h). The outputs of all heads are concatenated and linearly projected back to d_model.

Why multiple heads? Each head can specialize in attending to different types of relationships: one head might track syntactic dependencies, another co-references, another long-range semantic relationships. GPT-3 has 96 attention heads per layer. Empirical analysis of attention patterns shows different heads learn distinctly interpretable behaviors.

The Feed-Forward Sublayer

After multi-head attention, each Transformer layer applies a position-wise feed-forward network (FFN): two linear transformations with a non-linear activation (ReLU or GELU) in between. The inner dimension is typically 4× d_model:

FFN(x) = max(0, xW₁ + b₁)W₂ + b₂

The FFN operates independently on each token position (no cross-token interaction). Research suggests FFN layers act as key-value memories, storing factual associations learned during pre-training. The FFN typically contains ~2/3 of a Transformer's total parameters.

Layer Normalization and Residual Connections

Each sublayer (attention and FFN) is wrapped with a residual connection (output = sublayer(x) + x) and layer normalization. The original paper applied LayerNorm after the residual (Post-LN). Modern LLMs (GPT-2+, LLaMA) apply it before (Pre-LN), which improves training stability for very deep networks.

Encoder vs Decoder: Two Architectural Variants

The original Transformer had both an encoder and decoder. Modern LLMs use one or the other:

Variant	Attention Type	Pre-training Objective	Best For	Examples
Encoder-only	Bidirectional (all tokens attend to all)	Masked LM	Classification, NER, embeddings	BERT, RoBERTa
Decoder-only	Causal (each token attends left only)	Next-token prediction	Generation, coding, chat	GPT-4, Claude, LLaMA, Mistral
Encoder-Decoder	Bidirectional encoder + causal decoder	Span corruption	Translation, summarization	T5, BART

Scaling Laws

A landmark 2020 paper from OpenAI ("Scaling Laws for Neural Language Models", Kaplan et al.) showed that LLM performance improves predictably as a power law with model size (parameters N), dataset size (tokens D), and compute budget (FLOPs C). The key insight: you must scale all three together — a 10× bigger model trained on the same data is not as efficient as a model of intermediate size trained on proportionally more data.

The Chinchilla scaling laws (Hoffmann et al., DeepMind, 2022) refined this: for a given compute budget, the optimal strategy is roughly 1:1 ratio of model parameters to training tokens. This led to models like LLaMA-3 (8B parameters on ~15T tokens) outperforming earlier much-larger models on many benchmarks.

Modern Transformer Innovations

Grouped Query Attention (GQA): Multiple query heads share key/value heads, reducing KV-cache memory and inference cost. Used in LLaMA-3, Mistral.
Sliding Window Attention: Tokens attend only to a local window, enabling efficient O(n) scaling for long contexts. Used in Mistral, Mamba hybrid models.
Mixture of Experts (MoE): The FFN is replaced by multiple "expert" FFNs; a router selects which experts activate for each token. Dramatically increases model capacity with lower per-token compute. Used in GPT-4 (estimated), Mixtral-8x7B.
Flash Attention: An IO-aware exact attention implementation that avoids materializing the large attention matrix in HBM, yielding 2–4× speedups and enabling longer contexts.

Frequently Asked Questions

What is self-attention in the Transformer?

Self-attention allows each token to directly attend to every other token in the sequence. It computes query (Q), key (K), and value (V) vectors for each token via learned linear projections, then computes attention weights as softmax(QK^T/√d_k) and returns the weighted sum of values. This enables direct modeling of long-range dependencies in O(n²) time and O(n²) memory.

What is the difference between BERT and GPT?

BERT uses a bidirectional encoder: tokens attend to both left and right context, making it ideal for understanding tasks (classification, NER, question answering). GPT uses a causal decoder: tokens attend only to previous tokens, making it ideal for text generation, coding, and conversation. All major chat LLMs (GPT-4, Claude, Gemini, LLaMA) are decoder-only.

Why does the Transformer not use RNNs?

RNNs process tokens sequentially (not parallelizable) and suffer from vanishing gradients over long sequences. Transformers process all tokens in parallel via self-attention, enabling GPU-efficient training and direct long-range interactions. The tradeoff is O(n²) attention complexity vs O(n) for RNNs, but hardware parallelism makes Transformers faster in practice for modern sequence lengths.

What is positional encoding?

Since self-attention is permutation-invariant, positional information must be injected explicitly. Original Transformers use fixed sinusoidal encodings. Modern LLMs use Rotary Position Embedding (RoPE) or ALiBi, which encode relative positions and extrapolate better to unseen sequence lengths.

How many parameters does a Transformer have?

Parameters scale with depth × width². GPT-2: 117M–1.5B. GPT-3: 175B. LLaMA-3 70B: 70B. GPT-4: ~1T (MoE, estimated). Most parameters live in the attention projection matrices and FFN layers.

Read the latest Transformer research

AI Pentium tracks new papers on attention mechanisms, efficient Transformers, and LLM architecture from arXiv daily.

Browse LLM papers How ChatGPT Works →