Why RAG Exists
Your company has thousands of internal documents. Your LLM knows nothing about them - it was trained on public internet data, not your product specs, support tickets, or policy docs.
You have two options:
- Fine-tune the model on your data. Expensive, slow, requires ML expertise, and goes stale the moment a document changes.
- RAG - Retrieval-Augmented Generation. Store your documents in a vector database, retrieve the most relevant ones at query time, and inject them into the prompt. Fast, cheap, always up to date.
RAG is why enterprise AI applications exist. It turns a general-purpose LLM into one that knows your business.
The Full RAG Pipeline
There are two phases: indexing (done once, or incrementally) and querying (done at runtime for every user request).
RAG Pipeline: Indexing and Query Phases
flowchart TD D[Your Documents] --> C[Chunker Split into segments] C --> E[Embedding Model Text → vector] E --> VS[(Vector Store ChromaDB)] Q[User Query] --> QE[Embed Query] QE --> RET[Similarity Search] VS --> RET RET --> CTX[Context Assembly] CTX --> LLM[LLM Generate answer] LLM --> ANS[Answer + Sources] style D fill:#dbeafe,stroke:#2563eb,color:#1d4ed8 style VS fill:#f3e8ff,stroke:#7c3aed,color:#7c3aed style LLM fill:#f3e8ff,stroke:#7c3aed,color:#7c3aed style ANS fill:#dcfce7,stroke:#16a34a,color:#15803d style Q fill:#dbeafe,stroke:#2563eb,color:#1d4ed8flowchart TD D[Your Documents] --> C[Chunker Split into segments] C --> E[Embedding Model Text → vector] E --> VS[(Vector Store ChromaDB)] Q[User Query] --> QE[Embed Query] QE --> RET[Similarity Search] VS --> RET RET --> CTX[Context Assembly] CTX --> LLM[LLM Generate answer] LLM --> ANS[Answer + Sources] style D fill:#dbeafe,stroke:#2563eb,color:#1d4ed8 style VS fill:#f3e8ff,stroke:#7c3aed,color:#7c3aed style LLM fill:#f3e8ff,stroke:#7c3aed,color:#7c3aed style ANS fill:#dcfce7,stroke:#16a34a,color:#15803d style Q fill:#dbeafe,stroke:#2563eb,color:#1d4ed8
Indexing phase (happens once or on document updates):
- Chunk - Split documents into segments (typically 256-512 tokens each)
- Embed - Convert each chunk to a vector using an embedding model
- Store - Save vectors + original text in a vector database
Query phase (happens on every user request):
- Embed the query - Use the same embedding model to convert the question to a vector
- Similarity search - Find the top-K chunks whose vectors are closest to the query vector
- Assemble context - Build a prompt that includes the retrieved chunks
- Generate - Send the augmented prompt to the LLM and return the answer
What Is a Vector and Why Does It Work?
An embedding model converts text into a list of numbers (a vector) - typically 1,536 numbers for OpenAI’s text-embedding-3-small. Similar text produces similar vectors. “dog” and “puppy” end up close together in vector space. “dog” and “quarterly revenue” end up far apart.
Similarity search finds the chunks whose vectors are closest to the query vector, measured by cosine similarity. This is why RAG works even when the user’s question uses different words than the document - the concepts align even when the vocabulary doesn’t.
Full-text search matches exact words. A query for “dog” won’t find a document that says “canine.” Vector similarity search is semantic - it matches meaning, not keywords. For question-answering over documents, semantic search retrieves 2-5× more relevant chunks than keyword search.
Chunking: The Most Overlooked Step
Chunk size is the single most impactful parameter in your RAG system. Too small, and each chunk lacks context - the LLM gets fragments. Too large, and you hit context window limits and dilute relevance.
Practical starting point: 512 tokens per chunk with a 50-token overlap between consecutive chunks. The overlap prevents a sentence from being split in a way that loses its meaning at the boundary.
[chunk 1: tokens 0-511]
[chunk 2: tokens 462-973] ← 50-token overlap with chunk 1
[chunk 3: tokens 924-1435] ← 50-token overlap with chunk 2
The overlap means that even if a key sentence lands at the edge of a chunk, it appears in full in one of the two surrounding chunks.
Build It: ChromaDB + OpenAI RAG
This example embeds three documents about different topics, then queries the collection to demonstrate retrieval and generation.
Minimal RAG Pipeline with ChromaDB and OpenAI
Example code (static). Copy and run locally in your own environment.
import os
import chromadb
from openai import OpenAI
# Initialize clients
oai = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
chroma = chromadb.Client()
collection = chroma.create_collection("docs")
# --- INDEXING PHASE ---
documents = [
"The company refund policy allows returns within 30 days of purchase with a receipt.",
"Our data retention policy requires all customer records to be deleted after 7 years.",
"The on-call rotation for the infrastructure team runs Monday to Sunday, 2-week cycles.",
]
doc_ids = ["doc-1", "doc-2", "doc-3"]
def embed(texts: list[str]) -> list[list[float]]:
resp = oai.embeddings.create(
model="text-embedding-3-small",
input=texts
)
return [r.embedding for r in resp.data]
# Embed and store all documents
vectors = embed(documents)
collection.add(
ids=doc_ids,
embeddings=vectors,
documents=documents,
)
print(f"Indexed {len(documents)} documents.")
# --- QUERY PHASE ---
query = "How long do I have to return something I bought?"
# Embed the query using the same model
query_vector = embed([query])[0]
# Retrieve top-2 most similar chunks
results = collection.query(
query_embeddings=[query_vector],
n_results=2,
)
retrieved_chunks = results["documents"][0]
context = "\n\n".join(retrieved_chunks)
print(f"\nRetrieved chunks:\n{context}")
# Augment the prompt with retrieved context
augmented_prompt = f"""Use only the context below to answer the question.
Context:
{context}
Question: {query}
Answer:"""
# Generate the answer
response = oai.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "You answer questions using only the provided context."},
{"role": "user", "content": augmented_prompt},
],
max_tokens=200,
temperature=0,
)
print(f"\nAnswer: {response.choices[0].message.content}")
Run this and you’ll see: the query about returns retrieves the refund policy chunk (not the data retention or on-call chunks), and the LLM answers using only that context.
Understanding Similarity Scores
ChromaDB returns a distance score alongside each retrieved chunk (lower = more similar). You should use this to filter out low-quality retrievals. If the closest chunk has a cosine distance above ~0.5, you’re probably retrieving noise - consider returning “I don’t have information about this” rather than hallucinating an answer from irrelevant context.
distances = results["distances"][0]
MIN_SIMILARITY = 0.5
for doc, dist in zip(retrieved_chunks, distances):
if dist < MIN_SIMILARITY:
# Good retrieval - use it
pass
else:
# Poor retrieval - discard or warn
pass
Chunk size is your most important parameter. Start with 512 tokens and 50-token overlap. If your retrieval recall is poor (right answers aren’t showing up in top-K), try smaller chunks (256 tokens). If context is too fragmented, try larger chunks (768 tokens). Measure with a retrieval eval before moving to end-to-end evals - you can’t fix a generation problem that’s actually a retrieval problem.
Test retrieval separately from generation. Phase 1: can your system retrieve the right chunks given a known question? Build a retrieval eval with 20-30 query/expected-chunk pairs and measure recall@3 (does the right chunk appear in the top 3?). Only after retrieval recall is above 80% should you build end-to-end generation evals. Mixing the two makes failures impossible to diagnose.
What You’ve Built vs. What Production Needs
This example uses an in-memory ChromaDB collection (data is lost on restart). A production RAG system adds:
- Persistent vector store - ChromaDB with disk persistence, Pinecone, Weaviate, or pgvector
- Document ingestion pipeline - Watch for new/updated documents and re-index incrementally
- Metadata filtering - Filter by department, date, or access level before semantic search
- Re-ranking - A second-pass model (cross-encoder) to re-sort the top-K results by relevance
- Citation tracking - Return source document IDs alongside the answer so users can verify
Embedding models and LLMs should never be the same model. Use a dedicated embedding model (text-embedding-3-small) and a separate generation model (gpt-4o-mini). Mixing them causes subtle bugs when you upgrade: if you upgrade the embedding model, all your stored vectors become incompatible with new embeddings and retrieval breaks silently. Keep the two concerns completely separate with separate version tracking.
What’s Next
In the next tutorial you’ll build an AI agent - a system that doesn’t just answer questions but can take actions, call tools, and loop until it solves a problem. RAG is the memory; agents are the hands.
Interview Notes: Advanced RAG Patterns
Basic RAG retrieves chunks by embedding similarity. Production RAG often combines several techniques:
| Pattern | What it adds |
|---|---|
| Hybrid search | Combines dense vectors with keyword/BM25 search. |
| Reranking | Reorders candidates using a stronger cross-encoder or reranker. |
| ColBERT-style retrieval | Late interaction retrieval that keeps token-level matching signals. |
| HyDE | Generates a hypothetical answer/document, then retrieves against it. |
| RAPTOR | Builds hierarchical summaries for multi-hop or broad questions. |
| GraphRAG | Uses entities and relationships when the answer depends on graph structure. |
| Query rewriting | Converts user questions into retrieval-optimized queries. |
Choose the pattern based on failure mode. If retrieval misses exact terms, add hybrid search. If chunks are noisy, add reranking. If answers require relationships, consider GraphRAG.
Interview Practice
- What problem does RAG solve compared with prompting alone?
- Describe the basic ingest, retrieve, generate pipeline.
- When should you add hybrid search?
- What is reranking, and why does it improve answer quality?
- Compare HyDE, RAPTOR, GraphRAG, and ColBERT-style retrieval.
- What metrics would you use to evaluate retrieval quality?