AI Technology Explained
What is RAG (Retrieval-Augmented Generation)?
Quick Summary
RAG (Retrieval-Augmented Generation) solves LLMs' biggest limitation: reliance on static training data. A RAG system retrieves relevant document chunks from an external knowledge base at query time and injects them into the LLM's context alongside the user's question. The LLM can then cite real sources and stay current without retraining.
Why LLMs Need RAG
Large language models store knowledge in their billions of parameters — "parametric memory." This creates fundamental limitations:
- Knowledge cutoff: Training data has a fixed date. GPT-4 training cut off in April 2023; Claude 3.5 in early 2024. The model cannot know about newer events, papers, or product releases.
- Hallucination: The model generates plausible-sounding text even when it doesn't "know" the answer. It cannot reliably distinguish what it knows from what it's confabulating.
- Private data: The model was not trained on your company's internal documents, codebase, or customer data — and you cannot retrain it for each update.
- Scalability: Fine-tuning a 70B-parameter model costs thousands of dollars in GPU time and must be repeated for every knowledge update.
RAG addresses all four problems by separating knowledge storage (a document store you control) from reasoning (the LLM).
The RAG Pipeline: Step by Step
Step 1: Document Ingestion and Chunking
Source documents (PDFs, web pages, database records, code files) are loaded and split into smaller chunks. Chunk size is a critical hyperparameter. Typical values:
- 256–512 tokens per chunk: good balance of specificity and context for most text
- 32–64 token overlap between chunks: preserves continuity across chunk boundaries
Advanced chunking strategies: sentence-level chunks (split at sentence boundaries), recursive character splitters (split at paragraphs → sentences → words as fallback), and semantic chunking (use embedding similarity to detect topic shifts and split there).
Step 2: Embedding
Each chunk is passed through an embedding model — a neural network that converts text into a dense vector (typically 768–3072 dimensions) that captures semantic meaning. Text with similar meaning has nearby vectors in this high-dimensional space.
Popular embedding models:
| Model | Dimensions | Cost | Best For |
|---|---|---|---|
| OpenAI text-embedding-3-small | 1536 | $0.02/M tokens | General English, production |
| OpenAI text-embedding-3-large | 3072 | $0.13/M tokens | High-accuracy retrieval |
| BAAI/bge-m3 (open source) | 1024 | Free | Multilingual, self-hosted |
| nomic-embed-text (open source) | 768 | Free | English, competitive with OpenAI small |
Step 3: Vector Database Storage
Chunk embeddings are stored in a vector database that supports fast approximate nearest-neighbor (ANN) search. Each entry stores: the raw chunk text, its embedding vector, and metadata (source document, page number, timestamp).
Popular vector databases:
- Pinecone: Managed cloud service, simplest to get started, strong ANN performance
- Weaviate: Open-source, supports hybrid search (dense + BM25 keyword)
- Qdrant: Rust-based open-source, fast, good filtering support
- Milvus: Open-source, highly scalable, used at enterprise scale
- pgvector: PostgreSQL extension — zero new infrastructure if you already use Postgres
- FAISS: In-memory library (not a full database), ideal for prototyping
Step 4: Query-Time Retrieval
When a user submits a query:
- The query is embedded using the same model used for document embedding.
- The vector database performs an ANN search to find the k most similar chunk embeddings (typically k=3–10).
- The retrieved chunks are ranked by similarity score and optionally re-ranked by a cross-encoder reranker for improved precision.
Hybrid search (combining dense semantic search with BM25 keyword matching) generally outperforms either alone, especially for queries containing specific technical terms or proper nouns.
Step 5: Context Injection and LLM Generation
The top-k retrieved chunks are inserted into the LLM's prompt alongside the user query, typically in a template like:
You are a helpful assistant. Answer the question using ONLY the provided context.
If the answer is not in the context, say you don't know.
Context:
[Chunk 1 text]
[Chunk 2 text]
[Chunk 3 text]
Question: {user_query}
The LLM generates an answer grounded in the retrieved context. Optionally, it cites the source documents. If no relevant chunks are found (similarity below threshold), the system can fall back to the model's parametric knowledge or return "I don't know."
RAG vs Fine-Tuning
| Dimension | RAG | Fine-Tuning |
|---|---|---|
| Knowledge updates | Add documents any time | Requires retraining |
| Hallucination | Reduced (grounded in sources) | Still present |
| Cost | Low (no GPU training) | High (GPU-hours to days) |
| Source attribution | Easy (cite retrieved chunks) | Difficult |
| Teach new skills/style | No | Yes |
| Privacy | Data stays in your store | Data used in training |
Most production systems combine both: fine-tuned models for tone, format, and domain-specific skills + RAG for factual grounding and up-to-date knowledge.
Advanced RAG Patterns
- HyDE (Hypothetical Document Embeddings): Generate a hypothetical answer to the query, embed that, and use it for retrieval. Works better for complex questions where direct query embedding is too short to match document embedding well.
- Multi-query retrieval: Generate multiple paraphrases of the query with an LLM, retrieve for all of them, deduplicate results. Increases recall for ambiguous queries.
- Parent-child chunking: Store small child chunks for precise retrieval but return their larger parent chunk as context, balancing retrieval precision with generation context richness.
- Graph RAG: Build a knowledge graph from documents; retrieve based on entity relationships in addition to semantic similarity. Better for multi-hop reasoning.
- Reranking: After initial k-NN retrieval, apply a cross-encoder reranker (e.g. Cohere Rerank, bge-reranker) to reorder chunks by relevance to the specific query.
RAG Evaluation
Evaluate your RAG pipeline on three dimensions:
- Retrieval quality: Recall@k (are relevant chunks in top k?), MRR (mean reciprocal rank)
- Generation faithfulness: Is the answer supported by the retrieved context? (can be scored by an LLM judge)
- Answer relevance: Does the answer address the question? (RAGAS, TruLens frameworks automate this)
Popular RAG Frameworks
- LangChain: The most popular Python framework for building RAG and LLM chains; large ecosystem of integrations
- LlamaIndex: Specialized for document ingestion and RAG; excellent chunking and indexing abstractions
- Haystack (deepset): Production-focused, supports hybrid search, strong evaluation tooling
- DSPy: Treats RAG as an optimization problem; automatically tunes prompts and retrieval for maximum accuracy
Further Reading
- How ChatGPT Works — understand what RAG is augmenting
- Transformer Architecture — the model at the core of RAG generation
- ChatGPT vs Claude — RAG performance comparison
- LLM research papers on AI Pentium
Frequently Asked Questions
What is RAG (Retrieval-Augmented Generation)?
RAG is a technique that grounds an LLM's responses in retrieved documents rather than its static training memory. At query time, relevant document chunks are retrieved from a vector database and injected into the LLM's context. The LLM then generates an answer based on those real sources, dramatically reducing hallucination.
How does RAG reduce hallucination?
By providing the LLM with explicit, retrieved context that contains the actual answer. The model is prompted to only answer from the provided context, turning generation into an extraction/synthesis task rather than a memory recall task. This makes errors traceable: if the answer is wrong, you can see which chunks were retrieved.
What is a vector database?
A vector database stores text chunks as high-dimensional embedding vectors and supports fast approximate nearest-neighbor (ANN) search. Popular options: Pinecone (managed), Qdrant, Weaviate, Milvus (open source), pgvector (Postgres extension). For prototyping, FAISS is a simple in-memory option.
RAG vs fine-tuning: when to use each?
Use RAG when you need up-to-date factual knowledge, source attribution, privacy, or frequent knowledge updates. Use fine-tuning when you need to teach the model a new skill, tone, or domain-specific format that cannot be expressed in a prompt. Combine both in production for best results.
What is chunking in RAG?
Chunking splits source documents into smaller passages before embedding. The ideal chunk is large enough to contain a self-contained answer but small enough to match specific queries accurately. Common sizes: 256–512 tokens with 32–64 token overlap. Semantic chunking (splitting at topic boundaries) can outperform fixed-size splitting.
Read the latest RAG and LLM research
AI Pentium indexes new papers on retrieval-augmented generation, vector databases, and LLM grounding from arXiv daily.
Browse LLM papers How ChatGPT Works →