Build Your First RAG System

Why RAG Exists

Your company has thousands of internal documents. Your LLM knows nothing about them - it was trained on public internet data, not your product specs, support tickets, or policy docs.

You have two options:

Fine-tune the model on your data. Expensive, slow, requires ML expertise, and goes stale the moment a document changes.
RAG - Retrieval-Augmented Generation. Store your documents in a vector database, retrieve the most relevant ones at query time, and inject them into the prompt. Fast, cheap, always up to date.

RAG is why enterprise AI applications exist. It turns a general-purpose LLM into one that knows your business.

The Full RAG Pipeline

There are two phases: indexing (done once, or incrementally) and querying (done at runtime for every user request).

RAG Pipeline: Indexing and Query Phases

flowchart TD
  D[Your Documents] --> C[Chunker
Split into segments]
  C --> E[Embedding Model
Text → vector]
  E --> VS[(Vector Store
ChromaDB)]

  Q[User Query] --> QE[Embed Query]
  QE --> RET[Similarity Search]
  VS --> RET
  RET --> CTX[Context Assembly]
  CTX --> LLM[LLM
Generate answer]
  LLM --> ANS[Answer + Sources]

  style D fill:#dbeafe,stroke:#2563eb,color:#1d4ed8
  style VS fill:#f3e8ff,stroke:#7c3aed,color:#7c3aed
  style LLM fill:#f3e8ff,stroke:#7c3aed,color:#7c3aed
  style ANS fill:#dcfce7,stroke:#16a34a,color:#15803d
  style Q fill:#dbeafe,stroke:#2563eb,color:#1d4ed8

Code copied! Link copied!

Indexing phase (happens once or on document updates):

Chunk - Split documents into segments (typically 256-512 tokens each)
Embed - Convert each chunk to a vector using an embedding model
Store - Save vectors + original text in a vector database

Query phase (happens on every user request):

Embed the query - Use the same embedding model to convert the question to a vector
Similarity search - Find the top-K chunks whose vectors are closest to the query vector
Assemble context - Build a prompt that includes the retrieved chunks
Generate - Send the augmented prompt to the LLM and return the answer

What Is a Vector and Why Does It Work?

An embedding model converts text into a list of numbers (a vector) - typically 1,536 numbers for OpenAI’s text-embedding-3-small. Similar text produces similar vectors. “dog” and “puppy” end up close together in vector space. “dog” and “quarterly revenue” end up far apart.

Similarity search finds the chunks whose vectors are closest to the query vector, measured by cosine similarity. This is why RAG works even when the user’s question uses different words than the document - the concepts align even when the vocabulary doesn’t.

Why Not Just Use Full-Text Search?

Full-text search matches exact words. A query for “dog” won’t find a document that says “canine.” Vector similarity search is semantic - it matches meaning, not keywords. For question-answering over documents, semantic search retrieves 2-5× more relevant chunks than keyword search.

Chunking: The Most Overlooked Step

Chunk size is the single most impactful parameter in your RAG system. Too small, and each chunk lacks context - the LLM gets fragments. Too large, and you hit context window limits and dilute relevance.

Practical starting point: 512 tokens per chunk with a 50-token overlap between consecutive chunks. The overlap prevents a sentence from being split in a way that loses its meaning at the boundary.

[chunk 1: tokens 0-511]
[chunk 2: tokens 462-973]   ← 50-token overlap with chunk 1
[chunk 3: tokens 924-1435]  ← 50-token overlap with chunk 2

The overlap means that even if a key sentence lands at the edge of a chunk, it appears in full in one of the two surrounding chunks.

Build It: ChromaDB + OpenAI RAG

This example embeds three documents about different topics, then queries the collection to demonstrate retrieval and generation.

Minimal RAG Pipeline with ChromaDB and OpenAI

Example code (static). Copy and run locally in your own environment.

import os
import chromadb
from openai import OpenAI

# Initialize clients
oai = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
chroma = chromadb.Client()
collection = chroma.create_collection("docs")

# --- INDEXING PHASE ---
documents = [
  "The company refund policy allows returns within 30 days of purchase with a receipt.",
  "Our data retention policy requires all customer records to be deleted after 7 years.",
  "The on-call rotation for the infrastructure team runs Monday to Sunday, 2-week cycles.",
]
doc_ids = ["doc-1", "doc-2", "doc-3"]

def embed(texts: list[str]) -> list[list[float]]:
  resp = oai.embeddings.create(
      model="text-embedding-3-small",
      input=texts
  )
  return [r.embedding for r in resp.data]

# Embed and store all documents
vectors = embed(documents)
collection.add(
  ids=doc_ids,
  embeddings=vectors,
  documents=documents,
)
print(f"Indexed {len(documents)} documents.")

# --- QUERY PHASE ---
query = "How long do I have to return something I bought?"

# Embed the query using the same model
query_vector = embed([query])[0]

# Retrieve top-2 most similar chunks
results = collection.query(
  query_embeddings=[query_vector],
  n_results=2,
)

retrieved_chunks = results["documents"][0]
context = "\n\n".join(retrieved_chunks)
print(f"\nRetrieved chunks:\n{context}")

# Augment the prompt with retrieved context
augmented_prompt = f"""Use only the context below to answer the question.

Context:
{context}

Question: {query}
Answer:"""

# Generate the answer
response = oai.chat.completions.create(
  model="gpt-4o-mini",
  messages=[
      {"role": "system", "content": "You answer questions using only the provided context."},
      {"role": "user", "content": augmented_prompt},
  ],
  max_tokens=200,
  temperature=0,
)

print(f"\nAnswer: {response.choices[0].message.content}")

Run this and you’ll see: the query about returns retrieves the refund policy chunk (not the data retention or on-call chunks), and the LLM answers using only that context.

Understanding Similarity Scores

ChromaDB returns a distance score alongside each retrieved chunk (lower = more similar). You should use this to filter out low-quality retrievals. If the closest chunk has a cosine distance above ~0.5, you’re probably retrieving noise - consider returning “I don’t have information about this” rather than hallucinating an answer from irrelevant context.

distances = results["distances"][0]
MIN_SIMILARITY = 0.5

for doc, dist in zip(retrieved_chunks, distances):
    if dist < MIN_SIMILARITY:
        # Good retrieval  -  use it
        pass
    else:
        # Poor retrieval  -  discard or warn
        pass

⚙️ For Developers

Chunk size is your most important parameter. Start with 512 tokens and 50-token overlap. If your retrieval recall is poor (right answers aren’t showing up in top-K), try smaller chunks (256 tokens). If context is too fragmented, try larger chunks (768 tokens). Measure with a retrieval eval before moving to end-to-end evals - you can’t fix a generation problem that’s actually a retrieval problem.

🧪 For QA Engineers

Test retrieval separately from generation. Phase 1: can your system retrieve the right chunks given a known question? Build a retrieval eval with 20-30 query/expected-chunk pairs and measure recall@3 (does the right chunk appear in the top 3?). Only after retrieval recall is above 80% should you build end-to-end generation evals. Mixing the two makes failures impossible to diagnose.

What You’ve Built vs. What Production Needs

This example uses an in-memory ChromaDB collection (data is lost on restart). A production RAG system adds:

Persistent vector store - ChromaDB with disk persistence, Pinecone, Weaviate, or pgvector
Document ingestion pipeline - Watch for new/updated documents and re-index incrementally
Metadata filtering - Filter by department, date, or access level before semantic search
Re-ranking - A second-pass model (cross-encoder) to re-sort the top-K results by relevance
Citation tracking - Return source document IDs alongside the answer so users can verify

Production Gotcha

Embedding models and LLMs should never be the same model. Use a dedicated embedding model (text-embedding-3-small) and a separate generation model (gpt-4o-mini). Mixing them causes subtle bugs when you upgrade: if you upgrade the embedding model, all your stored vectors become incompatible with new embeddings and retrieval breaks silently. Keep the two concerns completely separate with separate version tracking.

What’s Next

In the next tutorial you’ll build an AI agent - a system that doesn’t just answer questions but can take actions, call tools, and loop until it solves a problem. RAG is the memory; agents are the hands.

Interview Notes: Advanced RAG Patterns

Basic RAG retrieves chunks by embedding similarity. Production RAG often combines several techniques:

Pattern	What it adds
Hybrid search	Combines dense vectors with keyword/BM25 search.
Reranking	Reorders candidates using a stronger cross-encoder or reranker.
ColBERT-style retrieval	Late interaction retrieval that keeps token-level matching signals.
HyDE	Generates a hypothetical answer/document, then retrieves against it.
RAPTOR	Builds hierarchical summaries for multi-hop or broad questions.
GraphRAG	Uses entities and relationships when the answer depends on graph structure.
Query rewriting	Converts user questions into retrieval-optimized queries.

Choose the pattern based on failure mode. If retrieval misses exact terms, add hybrid search. If chunks are noisy, add reranking. If answers require relationships, consider GraphRAG.

Interview Practice

What problem does RAG solve compared with prompting alone?
Describe the basic ingest, retrieve, generate pipeline.
When should you add hybrid search?
What is reranking, and why does it improve answer quality?
Compare HyDE, RAPTOR, GraphRAG, and ColBERT-style retrieval.
What metrics would you use to evaluate retrieval quality?

How to Use This Lesson

Hands-On Lab

Why RAG Exists

The Full RAG Pipeline

RAG Pipeline: Indexing and Query Phases

What Is a Vector and Why Does It Work?

Chunking: The Most Overlooked Step

Build It: ChromaDB + OpenAI RAG

Minimal RAG Pipeline with ChromaDB and OpenAI

Understanding Similarity Scores

What You’ve Built vs. What Production Needs

What’s Next

Interview Notes: Advanced RAG Patterns

Interview Practice

How to Use This Lesson

Hands-On Lab

Related Blog Deep Dives

Why RAG Exists

The Full RAG Pipeline

RAG Pipeline: Indexing and Query Phases

What Is a Vector and Why Does It Work?

Chunking: The Most Overlooked Step

Build It: ChromaDB + OpenAI RAG

Minimal RAG Pipeline with ChromaDB and OpenAI

Understanding Similarity Scores

What You’ve Built vs. What Production Needs

What’s Next

Interview Notes: Advanced RAG Patterns

Interview Practice