AI Technology Explained

What is RAG (Retrieval-Augmented Generation)?

Last updated: June 2025 · AI Pentium Editorial Team

Quick Summary

RAG (Retrieval-Augmented Generation) solves LLMs' biggest limitation: reliance on static training data. A RAG system retrieves relevant document chunks from an external knowledge base at query time and injects them into the LLM's context alongside the user's question. The LLM can then cite real sources and stay current without retraining.

Why LLMs Need RAG

Large language models store knowledge in their billions of parameters — "parametric memory." This creates fundamental limitations:

RAG addresses all four problems by separating knowledge storage (a document store you control) from reasoning (the LLM).

The RAG Pipeline: Step by Step

Step 1: Document Ingestion and Chunking

Source documents (PDFs, web pages, database records, code files) are loaded and split into smaller chunks. Chunk size is a critical hyperparameter. Typical values:

Advanced chunking strategies: sentence-level chunks (split at sentence boundaries), recursive character splitters (split at paragraphs → sentences → words as fallback), and semantic chunking (use embedding similarity to detect topic shifts and split there).

Step 2: Embedding

Each chunk is passed through an embedding model — a neural network that converts text into a dense vector (typically 768–3072 dimensions) that captures semantic meaning. Text with similar meaning has nearby vectors in this high-dimensional space.

Popular embedding models:

ModelDimensionsCostBest For
OpenAI text-embedding-3-small1536$0.02/M tokensGeneral English, production
OpenAI text-embedding-3-large3072$0.13/M tokensHigh-accuracy retrieval
BAAI/bge-m3 (open source)1024FreeMultilingual, self-hosted
nomic-embed-text (open source)768FreeEnglish, competitive with OpenAI small

Step 3: Vector Database Storage

Chunk embeddings are stored in a vector database that supports fast approximate nearest-neighbor (ANN) search. Each entry stores: the raw chunk text, its embedding vector, and metadata (source document, page number, timestamp).

Popular vector databases:

Step 4: Query-Time Retrieval

When a user submits a query:

  1. The query is embedded using the same model used for document embedding.
  2. The vector database performs an ANN search to find the k most similar chunk embeddings (typically k=3–10).
  3. The retrieved chunks are ranked by similarity score and optionally re-ranked by a cross-encoder reranker for improved precision.

Hybrid search (combining dense semantic search with BM25 keyword matching) generally outperforms either alone, especially for queries containing specific technical terms or proper nouns.

Step 5: Context Injection and LLM Generation

The top-k retrieved chunks are inserted into the LLM's prompt alongside the user query, typically in a template like:

You are a helpful assistant. Answer the question using ONLY the provided context.
If the answer is not in the context, say you don't know.

Context:
[Chunk 1 text]
[Chunk 2 text]
[Chunk 3 text]

Question: {user_query}

The LLM generates an answer grounded in the retrieved context. Optionally, it cites the source documents. If no relevant chunks are found (similarity below threshold), the system can fall back to the model's parametric knowledge or return "I don't know."

RAG vs Fine-Tuning

DimensionRAGFine-Tuning
Knowledge updatesAdd documents any timeRequires retraining
HallucinationReduced (grounded in sources)Still present
CostLow (no GPU training)High (GPU-hours to days)
Source attributionEasy (cite retrieved chunks)Difficult
Teach new skills/styleNoYes
PrivacyData stays in your storeData used in training

Most production systems combine both: fine-tuned models for tone, format, and domain-specific skills + RAG for factual grounding and up-to-date knowledge.

Advanced RAG Patterns

RAG Evaluation

Evaluate your RAG pipeline on three dimensions:

Popular RAG Frameworks

Further Reading

Frequently Asked Questions

What is RAG (Retrieval-Augmented Generation)?

RAG is a technique that grounds an LLM's responses in retrieved documents rather than its static training memory. At query time, relevant document chunks are retrieved from a vector database and injected into the LLM's context. The LLM then generates an answer based on those real sources, dramatically reducing hallucination.

How does RAG reduce hallucination?

By providing the LLM with explicit, retrieved context that contains the actual answer. The model is prompted to only answer from the provided context, turning generation into an extraction/synthesis task rather than a memory recall task. This makes errors traceable: if the answer is wrong, you can see which chunks were retrieved.

What is a vector database?

A vector database stores text chunks as high-dimensional embedding vectors and supports fast approximate nearest-neighbor (ANN) search. Popular options: Pinecone (managed), Qdrant, Weaviate, Milvus (open source), pgvector (Postgres extension). For prototyping, FAISS is a simple in-memory option.

RAG vs fine-tuning: when to use each?

Use RAG when you need up-to-date factual knowledge, source attribution, privacy, or frequent knowledge updates. Use fine-tuning when you need to teach the model a new skill, tone, or domain-specific format that cannot be expressed in a prompt. Combine both in production for best results.

What is chunking in RAG?

Chunking splits source documents into smaller passages before embedding. The ideal chunk is large enough to contain a self-contained answer but small enough to match specific queries accurately. Common sizes: 256–512 tokens with 32–64 token overlap. Semantic chunking (splitting at topic boundaries) can outperform fixed-size splitting.

Read the latest RAG and LLM research

AI Pentium indexes new papers on retrieval-augmented generation, vector databases, and LLM grounding from arXiv daily.

Browse LLM papers How ChatGPT Works →