What is the difference between RAG and fine-tuning?

Fine-tuning updates the LLM's weights to encode new knowledge or behaviors; RAG retrieves knowledge at inference time without changing weights. Fine-tuning is better for teaching the model new skills, styles, or formats. RAG is better for keeping knowledge up-to-date, grounding responses in proprietary documents, and reducing hallucination. RAG is cheaper (no GPU training), more transparent (sources can be cited), and more easily updated. Many production systems combine both.

What embedding model should I use for RAG?

For English text: OpenAI text-embedding-3-small or text-embedding-3-large offer strong quality at low cost. Open-source alternatives: BAAI/bge-m3 and Nomic Embed perform comparably to OpenAI models at zero inference cost. For multilingual RAG: multilingual-e5-large or mGTE-large. Use the MTEB leaderboard (Hugging Face) to compare embedding model performance on retrieval tasks.

AI Technology Explained

What is RAG (Retrieval-Augmented Generation)?

Q: What is RAG (Retrieval-Augmented Generation)?

Retrieval-Augmented Generation (RAG) is a technique that grounds an LLM's responses in retrieved, verified documents rather than relying solely on the model's parametric memory. At query time, the most relevant document chunks are retrieved from a vector database and injected into the LLM's context alongside the user question. The LLM then generates a response that is explicitly grounded in those retrieved passages, dramatically reducing hallucination for knowledge-intensive tasks.

Q: How does RAG reduce hallucination?

RAG reduces hallucination by providing the LLM with explicit, retrievable context at inference time. Instead of relying on facts memorized during training (which may be outdated, missing, or confabulated), the model generates answers from documents that are retrieved and displayed in-context. This separates knowledge storage (the document store) from reasoning (the LLM), allowing knowledge to be updated by adding new documents without retraining the model.

Q: What is a vector database?

A vector database stores high-dimensional embedding vectors and supports fast approximate nearest-neighbor (ANN) search. When you embed document chunks and queries into the same vector space, semantically similar content has nearby vectors. Popular vector databases include Pinecone, Weaviate, Qdrant, Milvus, and pgvector (PostgreSQL extension). For small-scale use, FAISS (Facebook AI Similarity Search) is a popular in-memory library.

Q: What is chunking in RAG?

Chunking is the process of splitting source documents into smaller passages before embedding them. Chunk size is a critical hyperparameter: too small (e.g. 50 tokens) and individual chunks lack sufficient context; too large (e.g. 1000+ tokens) and retrieved chunks are noisy and exceed context budget. Common strategies: fixed-size overlapping chunks (e.g. 256 tokens, 32-token overlap), sentence-level chunking, and semantic chunking (split at topic boundaries detected by an LLM or embedding similarity drop).

Last updated: June 2025 · AI Pentium Editorial Team

How ChatGPT Works Transformer Architecture LLMs Generative AI

Quick Summary

RAG (Retrieval-Augmented Generation) solves LLMs' biggest limitation: reliance on static training data. A RAG system retrieves relevant document chunks from an external knowledge base at query time and injects them into the LLM's context alongside the user's question. The LLM can then cite real sources and stay current without retraining.

Why LLMs Need RAG

Large language models store knowledge in their billions of parameters — "parametric memory." This creates fundamental limitations:

Knowledge cutoff: Training data has a fixed date. GPT-4 training cut off in April 2023; Claude 3.5 in early 2024. The model cannot know about newer events, papers, or product releases.
Hallucination: The model generates plausible-sounding text even when it doesn't "know" the answer. It cannot reliably distinguish what it knows from what it's confabulating.
Private data: The model was not trained on your company's internal documents, codebase, or customer data — and you cannot retrain it for each update.
Scalability: Fine-tuning a 70B-parameter model costs thousands of dollars in GPU time and must be repeated for every knowledge update.

RAG addresses all four problems by separating knowledge storage (a document store you control) from reasoning (the LLM).

The RAG Pipeline: Step by Step

Step 1: Document Ingestion and Chunking

Source documents (PDFs, web pages, database records, code files) are loaded and split into smaller chunks. Chunk size is a critical hyperparameter. Typical values:

256–512 tokens per chunk: good balance of specificity and context for most text
32–64 token overlap between chunks: preserves continuity across chunk boundaries

Advanced chunking strategies: sentence-level chunks (split at sentence boundaries), recursive character splitters (split at paragraphs → sentences → words as fallback), and semantic chunking (use embedding similarity to detect topic shifts and split there).

Step 2: Embedding

Each chunk is passed through an embedding model — a neural network that converts text into a dense vector (typically 768–3072 dimensions) that captures semantic meaning. Text with similar meaning has nearby vectors in this high-dimensional space.

Popular embedding models:

Model	Dimensions	Cost	Best For
OpenAI text-embedding-3-small	1536	$0.02/M tokens	General English, production
OpenAI text-embedding-3-large	3072	$0.13/M tokens	High-accuracy retrieval
BAAI/bge-m3 (open source)	1024	Free	Multilingual, self-hosted
nomic-embed-text (open source)	768	Free	English, competitive with OpenAI small

Step 3: Vector Database Storage

Chunk embeddings are stored in a vector database that supports fast approximate nearest-neighbor (ANN) search. Each entry stores: the raw chunk text, its embedding vector, and metadata (source document, page number, timestamp).

Popular vector databases:

Pinecone: Managed cloud service, simplest to get started, strong ANN performance
Weaviate: Open-source, supports hybrid search (dense + BM25 keyword)
Qdrant: Rust-based open-source, fast, good filtering support
Milvus: Open-source, highly scalable, used at enterprise scale
pgvector: PostgreSQL extension — zero new infrastructure if you already use Postgres
FAISS: In-memory library (not a full database), ideal for prototyping

Step 4: Query-Time Retrieval

When a user submits a query:

The query is embedded using the same model used for document embedding.
The vector database performs an ANN search to find the k most similar chunk embeddings (typically k=3–10).
The retrieved chunks are ranked by similarity score and optionally re-ranked by a cross-encoder reranker for improved precision.

Hybrid search (combining dense semantic search with BM25 keyword matching) generally outperforms either alone, especially for queries containing specific technical terms or proper nouns.

Step 5: Context Injection and LLM Generation

The top-k retrieved chunks are inserted into the LLM's prompt alongside the user query, typically in a template like:

You are a helpful assistant. Answer the question using ONLY the provided context.
If the answer is not in the context, say you don't know.

Context:
[Chunk 1 text]
[Chunk 2 text]
[Chunk 3 text]

Question: {user_query}

The LLM generates an answer grounded in the retrieved context. Optionally, it cites the source documents. If no relevant chunks are found (similarity below threshold), the system can fall back to the model's parametric knowledge or return "I don't know."

RAG vs Fine-Tuning

Dimension	RAG	Fine-Tuning
Knowledge updates	Add documents any time	Requires retraining
Hallucination	Reduced (grounded in sources)	Still present
Cost	Low (no GPU training)	High (GPU-hours to days)
Source attribution	Easy (cite retrieved chunks)	Difficult
Teach new skills/style	No	Yes
Privacy	Data stays in your store	Data used in training

Most production systems combine both: fine-tuned models for tone, format, and domain-specific skills + RAG for factual grounding and up-to-date knowledge.

Advanced RAG Patterns

HyDE (Hypothetical Document Embeddings): Generate a hypothetical answer to the query, embed that, and use it for retrieval. Works better for complex questions where direct query embedding is too short to match document embedding well.
Multi-query retrieval: Generate multiple paraphrases of the query with an LLM, retrieve for all of them, deduplicate results. Increases recall for ambiguous queries.
Parent-child chunking: Store small child chunks for precise retrieval but return their larger parent chunk as context, balancing retrieval precision with generation context richness.
Graph RAG: Build a knowledge graph from documents; retrieve based on entity relationships in addition to semantic similarity. Better for multi-hop reasoning.
Reranking: After initial k-NN retrieval, apply a cross-encoder reranker (e.g. Cohere Rerank, bge-reranker) to reorder chunks by relevance to the specific query.

RAG Evaluation

Evaluate your RAG pipeline on three dimensions:

Retrieval quality: Recall@k (are relevant chunks in top k?), MRR (mean reciprocal rank)
Generation faithfulness: Is the answer supported by the retrieved context? (can be scored by an LLM judge)
Answer relevance: Does the answer address the question? (RAGAS, TruLens frameworks automate this)

Popular RAG Frameworks

LangChain: The most popular Python framework for building RAG and LLM chains; large ecosystem of integrations
LlamaIndex: Specialized for document ingestion and RAG; excellent chunking and indexing abstractions
Haystack (deepset): Production-focused, supports hybrid search, strong evaluation tooling
DSPy: Treats RAG as an optimization problem; automatically tunes prompts and retrieval for maximum accuracy

Frequently Asked Questions

What is RAG (Retrieval-Augmented Generation)?

RAG is a technique that grounds an LLM's responses in retrieved documents rather than its static training memory. At query time, relevant document chunks are retrieved from a vector database and injected into the LLM's context. The LLM then generates an answer based on those real sources, dramatically reducing hallucination.

How does RAG reduce hallucination?

By providing the LLM with explicit, retrieved context that contains the actual answer. The model is prompted to only answer from the provided context, turning generation into an extraction/synthesis task rather than a memory recall task. This makes errors traceable: if the answer is wrong, you can see which chunks were retrieved.

What is a vector database?

A vector database stores text chunks as high-dimensional embedding vectors and supports fast approximate nearest-neighbor (ANN) search. Popular options: Pinecone (managed), Qdrant, Weaviate, Milvus (open source), pgvector (Postgres extension). For prototyping, FAISS is a simple in-memory option.

RAG vs fine-tuning: when to use each?

Use RAG when you need up-to-date factual knowledge, source attribution, privacy, or frequent knowledge updates. Use fine-tuning when you need to teach the model a new skill, tone, or domain-specific format that cannot be expressed in a prompt. Combine both in production for best results.

What is chunking in RAG?

Chunking splits source documents into smaller passages before embedding. The ideal chunk is large enough to contain a self-contained answer but small enough to match specific queries accurately. Common sizes: 256–512 tokens with 32–64 token overlap. Semantic chunking (splitting at topic boundaries) can outperform fixed-size splitting.

Read the latest RAG and LLM research

AI Pentium indexes new papers on retrieval-augmented generation, vector databases, and LLM grounding from arXiv daily.

Browse LLM papers How ChatGPT Works →

What is RAG (Retrieval-Augmented Generation)?

Quick Summary

Why LLMs Need RAG

The RAG Pipeline: Step by Step

Step 1: Document Ingestion and Chunking

Step 2: Embedding

Step 3: Vector Database Storage

Step 4: Query-Time Retrieval

Step 5: Context Injection and LLM Generation

RAG vs Fine-Tuning

Advanced RAG Patterns

RAG Evaluation

Popular RAG Frameworks

Further Reading

Frequently Asked Questions

What is RAG (Retrieval-Augmented Generation)?

How does RAG reduce hallucination?

What is a vector database?

RAG vs fine-tuning: when to use each?

What is chunking in RAG?

Read the latest RAG and LLM research