AI Technology Explained

How ChatGPT Works: A Complete Technical Guide

Last updated: June 2025 · AI Pentium Editorial Team

Quick Summary

ChatGPT is a large language model (LLM) built on the Transformer architecture. It generates text by predicting the next token in a sequence. Its capabilities come from three training phases: (1) pre-training on hundreds of billions of tokens of internet text, (2) supervised fine-tuning on human instruction-response pairs, and (3) RLHF (Reinforcement Learning from Human Feedback) to align outputs with human preferences.

What is ChatGPT?

ChatGPT is a conversational AI assistant developed by OpenAI, first released in November 2022. It is powered by GPT-3.5 and later GPT-4 — large language models containing hundreds of billions of parameters trained on vast amounts of internet text, books, code, and scientific literature.

Unlike narrow AI systems designed for single tasks, ChatGPT can engage in open-ended conversation, write and debug code, explain complex concepts, translate languages, summarize documents, and reason through multi-step problems — all in natural language. GPT-4o (released 2024) extended this to images and audio as well.

Phase 1: Pre-Training — Learning Language from Text

The foundation of ChatGPT is a Transformer neural network, pre-trained on hundreds of billions of tokens of text. During pre-training, the model learns to predict the next token in a sequence given all previous tokens. This is called the language modeling objective.

Given the partial sentence "The Transformer architecture uses self-___", the model learns to assign high probability to "attention" and low probability to everything else. Across trillions of such predictions on diverse text, the model develops rich internal representations of language, facts, logic, code, and reasoning.

GPT-4 is estimated to have over 1 trillion parameters and was trained on roughly 13 trillion tokens. Pre-training at this scale costs tens of millions of dollars in compute, running for months on clusters of thousands of NVIDIA A100 and H100 GPUs.

What is a Token?

LLMs do not process raw characters or words — they operate on tokens, which are subword units produced by a tokenizer. OpenAI uses a Byte Pair Encoding (BPE) tokenizer called tiktoken with a vocabulary of ~100,000 tokens.

Common words like "the", "AI", "model" are single tokens. Rarer or longer words are split: "unbelievable" → "un" + "believ" + "able". Code punctuation (brackets, semicolons) each tends to be a single token. On average, 1 token ≈ 0.75 English words or roughly 4 characters.

Context window sizes (e.g. GPT-4 Turbo: 128K tokens; Claude 3.5: 200K tokens) and API pricing are all measured in tokens. Understanding tokens is fundamental to prompt engineering and cost optimization.

The Transformer Architecture

ChatGPT's backbone is the Transformer architecture, introduced in the landmark 2017 paper "Attention Is All You Need" by Vaswani et al. at Google. The key innovation was self-attention: when processing any token, the model can directly attend to every other token in the sequence, weighting their relevance dynamically.

GPT models use the decoder-only variant of the Transformer: tokens are processed left to right, each attending only to previous tokens (causal masking). This is ideal for autoregressive generation. The model stacks many such layers — GPT-3 has 96 layers; GPT-4 is undisclosed but estimated to use a mixture-of-experts architecture.

→ See our full Transformer architecture explainer with diagrams

Phase 2: Supervised Fine-Tuning (SFT)

After pre-training, the model can predict text but does not behave as a helpful assistant. In SFT, OpenAI's human contractors write example conversations — a user prompt followed by an ideal assistant response. The model is fine-tuned on these (prompt, response) pairs using cross-entropy loss.

SFT teaches the model the conversational format and basic instruction-following. Thousands of such examples are typically used. However, SFT alone does not capture the full nuance of human preferences — that requires reinforcement learning.

Phase 3: RLHF — Reinforcement Learning from Human Feedback

RLHF is the technique that transformed GPT from a text predictor into a genuinely helpful assistant. The process has three steps:

  1. Collect preference data: Human raters compare pairs of model responses to the same prompt and rank them by quality — helpfulness, accuracy, harmlessness, and honesty.
  2. Train a reward model (RM): A separate neural network is trained on the comparison data to predict which response humans prefer. It learns to assign a scalar reward score to any (prompt, response) pair.
  3. Fine-tune with PPO: The main LLM is fine-tuned using Proximal Policy Optimization (PPO), an RL algorithm. The reward model provides the training signal: responses scored highly by the RM are reinforced.

RLHF dramatically improves helpfulness and reduces harmful outputs. However, it can cause "reward hacking" — the model learns to produce responses that appear good to raters but may not be truthful, contributing to hallucinations. Anthropic's Constitutional AI (used in Claude) addresses this by having the model self-critique against a set of written principles.

Autoregressive Inference: How ChatGPT Generates Text

At inference time, ChatGPT runs the following loop for each response:

  1. Your message is tokenized into a sequence of integer token IDs.
  2. These IDs pass through the embedding layer (each token → a high-dimensional vector) and all Transformer decoder layers.
  3. The final layer applies a linear projection and softmax to produce a probability distribution over the full vocabulary.
  4. A token is sampled from this distribution, governed by temperature and optionally top-p (nucleus) sampling.
  5. The new token is appended to the sequence and steps 2–4 repeat.
  6. Generation stops when an end-of-sequence (EOS) token is generated or the context limit is reached.

This is why LLMs are computationally expensive at scale: every new token requires a full forward pass through all model layers. Key optimizations include KV-cache (avoids recomputing attention keys/values for previous tokens), speculative decoding (a smaller model drafts multiple tokens; the large model verifies them in parallel), and quantization (reducing model weights from FP16 to INT8 or INT4).

Temperature and Top-p Sampling

Temperature (τ) rescales the logits before softmax: logits are divided by τ. At τ=0, only the highest-probability token is ever selected (greedy, deterministic). At τ=1, raw probabilities are used. At τ>1, the distribution flattens — output becomes more varied and creative but may lose coherence.

Top-p (nucleus) sampling selects only from the smallest set of tokens whose cumulative probability mass exceeds threshold p (e.g. p=0.9). This adaptively narrows or widens the candidate set depending on how concentrated the distribution is, avoiding both repetition and wild off-topic tokens.

Context Windows and Memory

ChatGPT has no persistent memory between separate conversations by default. Everything it "knows" about your session is contained in the context window. Once the conversation exceeds this limit, the oldest tokens are dropped.

ModelContext (tokens)≈ Words
GPT-3.5 Turbo16K12K
GPT-4 Turbo128K96K
GPT-4o128K96K
Claude 3.5 Sonnet200K150K
Gemini 1.5 Pro1M750K

For applications needing long-term memory or grounding in external documents, Retrieval-Augmented Generation (RAG) is the standard approach.

Why ChatGPT Hallucinates

Hallucination — generating confident but incorrect information — stems from several root causes:

The most effective mitigation is RAG (Retrieval-Augmented Generation), which grounds the LLM's response in retrieved, verified documents at inference time.

GPT Model Comparison

FeatureGPT-3.5 TurboGPT-4 TurboGPT-4o
Context window16K128K128K
MultimodalText onlyText + VisionText + Vision + Audio
CodingGoodExcellentExcellent
SpeedFastMediumFast
Price (input/M tokens)$0.50$10$2.50

Further Reading

Frequently Asked Questions

How does ChatGPT generate responses?

ChatGPT generates responses token by token via autoregressive decoding. It runs a full Transformer forward pass on the current sequence to produce a next-token probability distribution, samples from it (controlled by temperature), appends the new token, and repeats until an EOS token or length limit is reached.

What is a token in AI?

A token is the basic text unit an LLM processes — typically a word or subword fragment produced by BPE tokenization. GPT-4 uses tiktoken with ~100K tokens. On average 1 token ≈ 0.75 English words. Context limits and API costs are measured in tokens.

What is temperature in LLM?

Temperature scales the model's output logits before sampling. Temperature=0 → greedy (always pick the top token). Temperature=1 → sample from raw probabilities. Higher values increase diversity and creativity; lower values increase focus and determinism.

Why does ChatGPT make mistakes?

ChatGPT hallucinates because its objective is to produce plausible, preferred text — not to verify facts. It has a training knowledge cutoff, no real-time information access, and can confabulate blends of training data. RAG and tool use are the primary mitigation strategies.

What is RLHF?

Reinforcement Learning from Human Feedback is the training technique that aligns a pre-trained LLM to be helpful and safe. Human raters compare model outputs; a reward model is trained on those preferences; the LLM is then fine-tuned with PPO to maximize reward. Used by ChatGPT, Claude, Gemini, and most major models.

Track the latest LLM research

AI Pentium indexes new papers on large language models, RLHF, instruction tuning, and alignment from arXiv daily.

Browse LLM papers ChatGPT vs Claude →