Retrieval-Augmented Generation: how AI looks things up before it answers -- and why that changes everything. Bite-size intro + deep dive.
RAG is a technique that makes AI smarter by letting it look things up before it answers. Instead of relying only on what it learned during training, a RAG system retrieves relevant documents from a knowledge base, stuffs them into the prompt, and then generates a response grounded in real, up-to-date information.
Introduced by Facebook AI Research (now Meta AI) in 2020.
RAG is like giving the robot an open-book exam. Instead of relying on memory alone, it gets to flip through its notes before answering each question.
Without RAG, the robot is a student taking a test from memory -- and when it doesn't know the answer, it guesses with total confidence. With RAG, the teacher hands it a stack of relevant pages and says "use these." The answer it writes is way more accurate, because it's working from real material.
Watch a question flow through the full RAG pipeline. Hit play or click any step.
Hit play or click a step above to explore the pipeline.
Embeddings are the secret sauce that makes RAG possible. Here's the idea.
Vector databases store your embeddings and make similarity search blazingly fast. These are the ones that matter.
Fully managed vector database. Zero infrastructure headaches. Popular with startups and enterprises shipping fast.
Open-source vector search engine with built-in vectorization modules. Self-host or use their cloud.
Lightweight, runs locally, perfect for prototyping and small projects. The SQLite of vector databases.
Postgres extension that adds vector similarity search. Use it if you already run Postgres -- no new infra needed.
The technical concepts powering RAG under the surface.
Text gets converted into numerical vectors -- long lists of numbers that capture meaning. Similar concepts end up close together in vector space, even if they use different words.
A specialized database that stores embeddings and lets you search by similarity instead of exact keyword match. Think of it as a library organized by meaning, not alphabetical order.
When you ask a question, your query gets embedded too. The system finds the stored documents whose vectors are closest to your query vector -- the most semantically relevant matches.
Long documents get split into smaller pieces (chunks) before embedding. Chunk size matters: too big and you lose precision, too small and you lose context. Overlapping chunks help preserve continuity.
Knows your docs inside out
Cites its sources
Internal knowledge on demand
Finds the clause that matters
Two different approaches to making an LLM smarter. They solve different problems.
Retrieves external documents at query time
Cheaper -- no GPU training required
Instant -- just update the knowledge base
Factual Q&A, documentation, search over data
Depends on retrieval quality, adds latency
Retrains the model on new data (changes weights)
Expensive -- requires GPU compute for training
Slow -- retrain the whole model for new data
Style, tone, specialized behavior, niche domains
Risk of catastrophic forgetting, training overhead
Use RAG when you need the model to know specific facts or access up-to-date information. Use fine-tuning when you need the model to behave differently -- write in a specific style, follow a specialized workflow, or handle a niche domain. Many production systems use both.