LLM Mastery for Enterprise AI Engineering / Intermediate Track Module 5 / 8

LLM Mastery for Enterprise AI Engineering Intermediate ⏱ 60 min

DEVQABAPMEXEC

RAG, Memory, and Access Control

Retrieval-augmented generation, vector databases, chunking, memory systems, semantic search, and enterprise RAG security gates.

How to Use This Lesson

Start with the user problem, then map the pattern to architecture and failure modes.
If a code or design example is included, change one assumption and reason through the impact.
Use role callouts, checklists, and Q&A sections as implementation or interview prep notes.

Prerequisites: Tokens and Tokenization, Embeddings

Free · Subscriber Access

LLM Mastery for Enterprise AI Engineering

Free subscriber access. Enter your email to unlock all 18 modules, track your progress, and export your enterprise AI readiness packet.

Foundation to Advanced — tokens and transformers to deployment readiness and enterprise governance.
12 enterprise deliverables — data cards, eval reports, deployment reviews, governance packets.
Browser-local progress — your completion data stays private, no account needed.

LLM Mastery course page. This lesson is part 5 of 8 in the intermediate track. Use the lab and assessment sections as the completion standard, not optional reading.

Required mastery artifact: by the end of this lesson, update the running enterprise readiness packet for a realistic use case. Treat examples and vendor names as dated illustrations; defend decisions with current model, cost, risk, and evaluation evidence.

Module 06 — RAG & Memory

Teaching models to retrieve information and remember across sessions.

01 — RAG: Retrieval-Augmented Generation

The Core Problem

LLMs have a knowledge cutoff. They don’t know:

What happened last week
Your company’s internal documents
Your proprietary data
Specific domain information not in their training data

Fine-tuning can help, but:

Knowledge becomes stale (models don’t auto-update)
Fine-tuning is expensive
Facts drift and hallucinate over time in fine-tuned models

RAG solves this differently: instead of baking knowledge into the model, inject relevant knowledge at query time.

RAG in One Sentence

Find relevant documents → inject them into the prompt → let the model answer using those documents.

The RAG Pipeline

User Question
     ↓
[Embed the question] — convert question to a vector
     ↓
[Search vector database] — find most relevant document chunks
     ↓
[Retrieve top-K chunks] — e.g., top 5 most relevant passages
     ↓
[Build augmented prompt]:
  "Here is context:
   [CHUNK 1]
   [CHUNK 2]
   [CHUNK 3]
   
   Based on the above context, answer: [USER QUESTION]"
     ↓
[Send to LLM] — model answers using the provided context
     ↓
Response (grounded in real documents)

Why RAG Works So Well

Grounded: Model answers from real documents, not memory
Current: Documents can be updated without retraining
Verifiable: You can show sources
Cost-effective: No expensive fine-tuning for knowledge updates
Controllable: Only use authorized documents

Simple RAG Implementation

import anthropic
from sentence_transformers import SentenceTransformer
import numpy as np

# 1. Initialize
client = anthropic.Anthropic()
embedder = SentenceTransformer('all-MiniLM-L6-v2')

# 2. Your knowledge base (in reality, from documents/database)
documents = [
    "GDPR Article 17 establishes the 'right to erasure' (right to be forgotten). Data subjects can request deletion of their personal data when it's no longer necessary, when consent is withdrawn, or when it was unlawfully processed.",
    "PSD2 (Payment Services Directive 2) requires Strong Customer Authentication (SCA) for electronic payment transactions, using at least two of: knowledge (PIN/password), possession (phone/card), or inherence (biometrics).",
    "Basel III requires banks to maintain Common Equity Tier 1 (CET1) ratio of at least 4.5%, Tier 1 capital ratio of 6%, and Total Capital ratio of 8% of risk-weighted assets.",
    "DORA (Digital Operational Resilience Act) requires financial entities in the EU to have robust ICT risk management frameworks, incident reporting procedures, and conduct regular digital operational resilience testing.",
    "MiFID II requires investment firms to record all communications relating to transactions, including phone calls and electronic communications, and retain these records for at least 5 years.",
]

# 3. Create embeddings for all documents (do this once, store in DB)
doc_embeddings = embedder.encode(documents)

def retrieve_relevant_chunks(query: str, top_k: int = 3) -> list[str]:
    """Find most relevant document chunks for a query"""
    query_embedding = embedder.encode(query)
    
    # Calculate cosine similarity
    similarities = np.dot(doc_embeddings, query_embedding) / (
        np.linalg.norm(doc_embeddings, axis=1) * np.linalg.norm(query_embedding)
    )
    
    # Get top-k most similar
    top_indices = np.argsort(similarities)[-top_k:][::-1]
    
    return [(documents[i], similarities[i]) for i in top_indices]

def rag_answer(question: str) -> str:
    """Answer a question using RAG"""
    
    # Retrieve relevant context
    relevant_chunks = retrieve_relevant_chunks(question, top_k=3)
    
    # Build context
    context = "\n\n".join([
        f"Source {i+1} (relevance: {sim:.2f}):\n{chunk}"
        for i, (chunk, sim) in enumerate(relevant_chunks)
    ])
    
    # Build augmented prompt
    prompt = f"""Here is relevant regulatory information:

{context}

Based ONLY on the provided information above, answer this question:
{question}

If the provided information doesn't contain the answer, say "I don't have specific information about this in the provided documents."
Always cite which source you're drawing from."""

    # Get LLM response
    response = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=500,
        messages=[{"role": "user", "content": prompt}]
    )
    
    return response.content[0].text

# Test it
questions = [
    "What are the SCA requirements for payments?",
    "What is the minimum CET1 ratio under Basel III?",
    "How long must investment communications be retained?"
]

for q in questions:
    print(f"Q: {q}")
    print(f"A: {rag_answer(q)}\n")
    print("-" * 60)

RAG Quality Factors

Factor	Poor	Good
Chunking	Too small (loses context) or too large (drowns signal)	Optimally sized with overlap
Embeddings	Generic embeddings	Domain-specific embeddings
Retrieval	Simple cosine similarity	Hybrid (semantic + keyword)
Context injection	Dump all chunks	Filter, rank, deduplicate
Prompting	No guidance	Clear instructions, cite sources

Enterprise RAG Security Gate

Production RAG must enforce authorization before retrieved text reaches the model. A vector database is not automatically an access-control system.

For every chunk, store:

tenant_id
source document ID and version
owner
data classification
allowed groups or ACL
retention/deletion policy
source approval status
source freshness timestamp

Retrieval must filter by user permissions before prompt construction:

def filter_authorized_chunks(user, chunks):
    return [
        chunk for chunk in chunks
        if chunk["tenant_id"] == user["tenant_id"]
        and chunk["classification"] in user["allowed_classifications"]
        and bool(set(chunk["allowed_groups"]) & set(user["groups"]))
        and chunk["source_status"] == "approved"
    ]
```

Enterprise readiness checklist:

| Control | Required evidence |
|---------|-------------------|
| Document ACLs | Unauthorized users cannot retrieve restricted chunks |
| Tenant isolation | Cross-tenant queries return zero private chunks |
| Source freshness | Stale or withdrawn documents are excluded |
| Deletion | Removed documents are deleted from the index and backups according to policy |
| Prompt-injection defense | Retrieved text is treated as untrusted content |
| Retrieval audit | Query hash, user, chunk IDs, model, and decision are logged |

If a RAG system cannot enforce these controls, it is not ready for enterprise data.

---

# 02 — Vector Databases

## What is a Vector Database?

A regular database stores: name, age, email (exact values).
A vector database stores: embeddings (lists of 1536 numbers) and can find the **most similar** embeddings to a query embedding.

This "similarity search" at scale is what makes RAG work.

---

## How Vector Search Works

```
Your query: "PSD2 authentication requirements"
→ Embedding: [0.23, -0.14, 0.87, ...]

Database has 100,000 document embeddings.
Find: Which embeddings are closest to [0.23, -0.14, 0.87, ...]?

Distance metrics:
- Cosine similarity: angle between vectors (most common)
- Euclidean (L2): direct distance
- Dot product: similar to cosine if normalized

Returns: Top 5 most similar documents (and their similarity scores)

Popular Vector Databases

Database	Type	Best For
Chroma	In-memory/local	Development, small scale
FAISS	Library (not server)	Research, CPU search
Pinecone	Cloud-managed	Production, no ops
Weaviate	Open source server	Production, self-hosted
Qdrant	Open source server	High performance, Rust-based
pgvector	PostgreSQL extension	If you already use PostgreSQL
Milvus	Open source cluster	Very large scale

For most projects: Start with Chroma (development), move to Qdrant or pgvector for production.

Chroma — Getting Started

import chromadb
from sentence_transformers import SentenceTransformer

# Initialize
client = chromadb.Client()  # In-memory
# or: client = chromadb.PersistentClient(path="./chroma_db")

# Create a collection
collection = client.create_collection(
    name="compliance_docs",
    metadata={"hnsw:space": "cosine"}  # Use cosine similarity
)

# Add documents
documents = [
    "GDPR Article 17: Right to erasure...",
    "PSD2 Strong Customer Authentication...",
    "Basel III capital requirements...",
]

embedder = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = embedder.encode(documents).tolist()

collection.add(
    ids=["doc-001", "doc-002", "doc-003"],
    documents=documents,
    embeddings=embeddings,
    metadatas=[
        {"regulation": "GDPR", "article": "17"},
        {"regulation": "PSD2", "section": "SCA"},
        {"regulation": "Basel III", "category": "capital"},
    ]
)

# Query
results = collection.query(
    query_embeddings=embedder.encode(["authentication requirements"]).tolist(),
    n_results=2,
    include=["documents", "distances", "metadatas"]
)

print(results["documents"])
print(results["distances"])
print(results["metadatas"])

Qdrant — Production-Ready

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct

# Connect
client = QdrantClient(
    url="http://localhost:6333",  # or cloud URL
    api_key="your-api-key"       # for cloud
)

# Create collection
client.create_collection(
    collection_name="compliance_docs",
    vectors_config=VectorParams(size=384, distance=Distance.COSINE)
)

# Insert documents
client.upsert(
    collection_name="compliance_docs",
    points=[
        PointStruct(
            id=i,
            vector=embedder.encode(doc).tolist(),
            payload={"text": doc, "regulation": "GDPR", "page": i}
        )
        for i, doc in enumerate(documents)
    ]
)

# Search
results = client.search(
    collection_name="compliance_docs",
    query_vector=embedder.encode("authentication").tolist(),
    limit=5,
    with_payload=True
)

for result in results:
    print(f"Score: {result.score:.3f}")
    print(f"Text: {result.payload['text'][:100]}...")

pgvector — If You’re Already Using PostgreSQL

-- Enable extension
CREATE EXTENSION vector;

-- Create table with vector column
CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    content TEXT,
    regulation TEXT,
    embedding vector(384)  -- 384-dim embedding
);

-- Insert with embedding
INSERT INTO documents (content, regulation, embedding)
VALUES ('GDPR Article 17...', 'GDPR', '[0.23, -0.14, ...]');

-- Similarity search
SELECT content, regulation,
       1 - (embedding <=> '[0.25, -0.12, ...]'::vector) AS similarity
FROM documents
ORDER BY embedding <=> '[0.25, -0.12, ...]'::vector
LIMIT 5;
```

```python
# Python with psycopg2 and pgvector
import psycopg2
from pgvector.psycopg2 import register_vector

conn = psycopg2.connect("postgresql://user:pass@localhost/compliance_db")
register_vector(conn)

cursor = conn.cursor()
cursor.execute("""
    SELECT content, 1 - (embedding <=> %s) AS similarity
    FROM documents
    ORDER BY similarity DESC
    LIMIT 5
""", (query_embedding,))

results = cursor.fetchall()

03 — Chunking

The Art of Splitting Documents

Before embedding documents, you need to split them into chunks.

Why not embed the whole document?

Embeddings average meaning across the whole text → specific details get diluted
LLM context window can’t hold a 100-page PDF
A specific answer is buried in a 10-page document

Why not split at every word?

Individual sentences often lack context
”It was amended in 2018.” — what was amended? Need context.

Chunking Strategies

Fixed-size chunking

Split every N characters (or N tokens), with overlap:

def fixed_size_chunk(text: str, chunk_size: int = 500, overlap: int = 50) -> list[str]:
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        chunks.append(chunk)
        start = end - overlap  # Overlap for context continuity
    return chunks

# Example
text = "GDPR Article 17 establishes..." * 100  # Long document
chunks = fixed_size_chunk(text, chunk_size=500, overlap=50)
print(f"Created {len(chunks)} chunks")

Recursive character splitting (recommended default)

Split on natural boundaries: paragraphs → sentences → words → characters:

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,         # Target chunk size in characters
    chunk_overlap=50,       # Overlap between chunks
    separators=["\n\n", "\n", ". ", " ", ""]  # Try these separators in order
)

chunks = splitter.split_text(long_document_text)

Semantic chunking

Split where meaning changes significantly:

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

splitter = SemanticChunker(
    OpenAIEmbeddings(),
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=95  # Split when similarity drops below 95th percentile
)

chunks = splitter.split_text(text)
# Chunks may vary greatly in size, but each is semantically coherent

Document-structure-aware splitting

For PDFs with headings, use the structure:

# Split at headers (##, ###, etc.) for markdown documents
from langchain.text_splitter import MarkdownHeaderTextSplitter

headers_to_split_on = [
    ("#", "H1"),
    ("##", "H2"),
    ("###", "H3"),
]

splitter = MarkdownHeaderTextSplitter(headers_to_split_on)
chunks = splitter.split_text(markdown_document)
# Each chunk includes its header hierarchy as metadata

Choosing Chunk Size

Use Case	Chunk Size	Overlap
Dense legal/regulatory text	300-500 chars	50-100
General documents	500-1000 chars	100-200
Code	Whole functions (variable)	0-50
Conversational	200-300 chars	50

The golden rule: Chunk size should match the granularity of questions you expect.

If users ask about specific articles/clauses → smaller chunks. If users ask for broad summaries → larger chunks.

04 — Retrieval Pipelines

Beyond Simple Embedding Search

Basic RAG: embed query → find nearest documents → inject into prompt

Advanced RAG: multiple stages, multiple strategies, smart filtering.

Hybrid Retrieval (Semantic + Keyword)

Sometimes keyword matching beats semantic search:

“What does DORA article 5 paragraph 3 say?” → keyword search wins (exact article reference)
“What regulations apply to payment authentication?” → semantic search wins (conceptual query)

Hybrid search combines both:

from qdrant_client.models import SparseVector, NamedSparseVector

# Qdrant supports hybrid search with sparse + dense vectors
# BM25 (keyword) + Dense (semantic) combined with RRF (Reciprocal Rank Fusion)

# Most production RAG systems use hybrid retrieval

Re-ranking

Retrieve more candidates, then re-rank with a more powerful model:

from sentence_transformers import CrossEncoder

# Bi-encoder: fast, used for initial retrieval
retriever = SentenceTransformer('all-MiniLM-L6-v2')

# Cross-encoder: slow but accurate, used for re-ranking
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

def retrieve_and_rerank(query: str, top_k: int = 3):
    # Step 1: Fast retrieval — get top 20 candidates
    candidates = vector_db_search(query, top_k=20)
    
    # Step 2: Re-rank with cross-encoder (compares query+document together)
    scores = reranker.predict([(query, doc) for doc in candidates])
    
    # Step 3: Return top-k after re-ranking
    ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
    return [doc for doc, score in ranked[:top_k]]

Query Expansion & Transformation

Sometimes the user’s question is poorly phrased. Transform it first:

def expand_query(original_query: str, client) -> list[str]:
    """Generate multiple versions of the query for better retrieval"""
    
    response = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=200,
        messages=[{
            "role": "user",
            "content": f"""Generate 3 different versions of this question, each phrased differently:
            
Original: {original_query}

Output ONLY the 3 questions, one per line, no numbering."""
        }]
    )
    
    variants = response.content[0].text.strip().split('\n')
    return [original_query] + variants  # Include original + variants

# Then retrieve for all variants and merge results
def multi_query_retrieve(query: str, top_k: int = 5):
    query_variants = expand_query(query)
    all_results = []
    
    for variant in query_variants:
        results = vector_search(variant, top_k=top_k)
        all_results.extend(results)
    
    # Deduplicate by document ID, keeping highest similarity
    seen = {}
    for result in all_results:
        doc_id = result.id
        if doc_id not in seen or result.score > seen[doc_id].score:
            seen[doc_id] = result
    
    return sorted(seen.values(), key=lambda x: x.score, reverse=True)[:top_k]

RAG Evaluation Metrics

Metric	What It Measures
Recall@K	Did the relevant document appear in top K results?
MRR (Mean Reciprocal Rank)	How highly ranked is the first relevant result?
Answer correctness	Is the final answer right?
Faithfulness	Does the answer stay faithful to the retrieved context?
Context precision	How much of retrieved context was actually useful?
Context recall	Did we retrieve all the relevant information?

# Using RAGAS library for RAG evaluation
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_recall

results = evaluate(
    dataset=eval_dataset,  # Questions + retrieved context + generated answers + ground truth
    metrics=[faithfulness, answer_relevancy, context_recall]
)
print(results)

05 — AI Memory Systems

The Problem: LLMs Forget

Every LLM conversation starts fresh. The model has no memory of previous sessions.

For personal assistants, customer support bots, and ongoing workflows, this is a major limitation.

Types of Memory

1. Conversation Buffer (Short-term)

Keep the full conversation history in context:

messages = [
    {"role": "user", "content": "My name is Praveen"},
    {"role": "assistant", "content": "Nice to meet you, Praveen!"},
    {"role": "user", "content": "What's my name?"},
]
# Works within one session, but context grows unbounded

2. Summary Memory

Summarize old conversations to save tokens:

# After every N turns, summarize old turns:
summary = "User mentioned their name is Praveen and they work at Fiserv..."
messages = [
    {"role": "system", "content": f"Conversation summary: {summary}"},
    # Only keep last 5 turns in full
]

3. Entity Memory

Extract and store specific facts about entities:

memory_store = {
    "Praveen": {
        "employer": "Fiserv",
        "role": "Senior Application Analyst",
        "location": "Germany",
        "interests": ["AI", "compliance automation"]
    }
}
# Before each response, inject relevant entities

4. Episodic Memory (Long-term, Vector-based)

Store important conversation moments as embeddings, retrieve relevant ones:

# Store memorable conversation excerpts
memory_db.add("Praveen mentioned he's preparing for FDE role at Anthropic")

# Before each new conversation, search for relevant memories
relevant_memories = memory_db.search(current_topic, top_k=5)
system_prompt += f"\nRelevant memories:\n{relevant_memories}"

Practical Memory Architecture

class ConversationMemory:
    def __init__(self):
        self.short_term = []        # Recent messages (last 10)
        self.summary = ""           # Summary of older messages
        self.entity_store = {}      # Known facts about entities
        self.episodic_db = VectorDB()  # Searchable long-term memories
    
    def add_turn(self, role: str, content: str):
        self.short_term.append({"role": role, "content": content})
        
        # If context getting long, summarize old turns
        if len(self.short_term) > 20:
            self._compress_memory()
        
        # Extract entities
        self._extract_entities(content)
        
        # Store as episodic memory
        self.episodic_db.add(content)
    
    def _compress_memory(self):
        """Summarize older messages to save tokens"""
        old_turns = self.short_term[:10]
        self.short_term = self.short_term[10:]
        
        # Use LLM to summarize
        summary = summarize(old_turns)
        self.summary += f"\n{summary}"
    
    def get_context(self, current_query: str) -> list:
        """Build context for a new response"""
        context = []
        
        # Include summary of old conversation
        if self.summary:
            context.append({
                "role": "system",
                "content": f"Earlier conversation summary:\n{self.summary}"
            })
        
        # Include relevant episodic memories
        memories = self.episodic_db.search(current_query, top_k=3)
        if memories:
            context.append({
                "role": "system",
                "content": f"Relevant memories:\n{memories}"
            })
        
        # Include recent messages
        context.extend(self.short_term)
        
        return context

Memory Libraries

# mem0 — managed AI memory
from mem0 import Memory

m = Memory()
m.add("Praveen works at Fiserv and is building a compliance automation system", user_id="praveen")

# Later:
memories = m.search("compliance project", user_id="praveen")
# Returns: [{"memory": "Working on compliance automation at Fiserv..."}]

# Zep — production memory for AI applications
from zep_cloud.client import Zep
client = Zep(api_key="...")
# Handles memory automatically per session

06 — Semantic Search

Beyond Keyword Search

Traditional search: matches exact words. Semantic search: matches meaning.

Query: "rules about deleting customer data"

Keyword search finds:
→ Documents containing "rules", "deleting", "customer", "data"

Semantic search finds:
→ "GDPR Article 17 right to erasure" ← correct, even though no word overlap!
→ "data retention policies"
→ "customer data deletion procedures"

Implementing Semantic Search

from sentence_transformers import SentenceTransformer
import numpy as np

class SemanticSearch:
    def __init__(self, model_name='all-MiniLM-L6-v2'):
        self.model = SentenceTransformer(model_name)
        self.documents = []
        self.embeddings = None
    
    def index(self, documents: list[str]):
        """Index documents for search"""
        self.documents = documents
        self.embeddings = self.model.encode(documents, 
                                            show_progress_bar=True,
                                            batch_size=32)
        print(f"Indexed {len(documents)} documents")
    
    def search(self, query: str, top_k: int = 5) -> list[tuple]:
        """Search for most relevant documents"""
        query_embedding = self.model.encode(query)
        
        similarities = np.dot(self.embeddings, query_embedding) / (
            np.linalg.norm(self.embeddings, axis=1) * np.linalg.norm(query_embedding)
        )
        
        top_indices = np.argsort(similarities)[-top_k:][::-1]
        
        return [(self.documents[i], float(similarities[i])) for i in top_indices]

# Usage
search = SemanticSearch()
search.index(compliance_documents)

results = search.search("how to handle customer data deletion requests")
for doc, score in results:
    print(f"Score: {score:.3f} | {doc[:100]}...")

Embedding Models for Semantic Search

Model	Dimensions	Speed	Quality	Use Case
all-MiniLM-L6-v2	384	Very Fast	Good	General, development
all-mpnet-base-v2	768	Fast	Very Good	Production general
bge-large-en-v1.5	1024	Slow	Excellent	Production quality
text-embedding-3-small	1536	API	Very Good	OpenAI, production
text-embedding-3-large	3072	API	Excellent	OpenAI, high quality
e5-mistral-7b	4096	Slow	Best	Top quality, slow

For production RAG with compliance data: bge-large-en-v1.5 or text-embedding-3-small.

📝 Module 06 Summary

Concept	Key Takeaway
RAG	Find relevant docs → inject into prompt → ground answers in reality
Vector DB	Stores embeddings, finds similar documents by meaning (not keywords)
Chunking	Split documents into optimally-sized pieces before embedding
Hybrid retrieval	Combine semantic + keyword search for better coverage
Re-ranking	First retrieve broadly, then re-rank with powerful cross-encoder
Memory	Short-term (buffer), medium-term (summary), long-term (episodic)
Semantic search	Find documents by meaning, not exact word matches

🧠 Mental Model

RAG is like having a smart research assistant. When you ask a question:

They search the library (vector DB) for relevant books/articles

They bring you the most relevant passages (retrieval)

They help you find the answer within those passages (LLM generation)

Without RAG, the LLM is a scholar answering from memory — great for general knowledge, risky for specifics.

🏋️ Module Exercise

Build a compliance RAG system with Chroma + Claude:

# pip install chromadb sentence-transformers anthropic

import chromadb
from sentence_transformers import SentenceTransformer
import anthropic
import json

# Setup
chroma_client = chromadb.PersistentClient(path="./compliance_db")
collection = chroma_client.get_or_create_collection("regulations")
embedder = SentenceTransformer('all-MiniLM-L6-v2')
ai_client = anthropic.Anthropic()

# Documents to index
regulations = [
    {"id": "gdpr-17", "text": "GDPR Article 17 (Right to Erasure): Data subjects have the right to request deletion of personal data when: it's no longer necessary for the purpose collected; consent is withdrawn; data was unlawfully processed; or erasure is required by law.", "regulation": "GDPR"},
    {"id": "psd2-sca", "text": "PSD2 Strong Customer Authentication requires at least 2 of 3 factors: Knowledge (something only the user knows — PIN, password), Possession (something only the user has — card, phone), Inherence (something the user is — fingerprint, face).", "regulation": "PSD2"},
    {"id": "basel3-capital", "text": "Basel III Capital Requirements: Minimum CET1 ratio 4.5%; Tier 1 capital ratio 6%; Total Capital ratio 8%. Conservation buffer of 2.5% CET1. Countercyclical buffer 0-2.5%. Total minimum with buffers: 10.5% CET1.", "regulation": "Basel III"},
    {"id": "mifid2-records", "text": "MiFID II Article 16(7): Investment firms must keep records of all services, activities, and transactions. Communications relating to transactions must be recorded and retained for 5 years (regulators can extend to 7 years). Includes phone calls and electronic communications.", "regulation": "MiFID II"},
    {"id": "dora-ict", "text": "DORA (Digital Operational Resilience Act): Financial entities must establish comprehensive ICT risk management framework, implement incident classification and reporting procedures, conduct annual TLPT (Threat-Led Penetration Testing), and manage third-party ICT risks.", "regulation": "DORA"},
]

# Index documents
texts = [r["text"] for r in regulations]
embeddings = embedder.encode(texts).tolist()

collection.upsert(
    ids=[r["id"] for r in regulations],
    documents=texts,
    embeddings=embeddings,
    metadatas=[{"regulation": r["regulation"]} for r in regulations]
)

print(f"Indexed {len(regulations)} regulatory documents")

def compliance_rag(question: str) -> dict:
    """Answer a compliance question using RAG"""
    
    # 1. Embed the question
    query_embedding = embedder.encode(question).tolist()
    
    # 2. Retrieve relevant documents
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=3,
        include=["documents", "distances", "metadatas"]
    )
    
    # 3. Build context
    retrieved_docs = results["documents"][0]
    metadatas = results["metadatas"][0]
    distances = results["distances"][0]
    
    context_pieces = []
    for doc, meta, dist in zip(retrieved_docs, metadatas, distances):
        similarity = 1 - dist  # Chroma uses L2 distance, convert to similarity
        context_pieces.append(f"[{meta['regulation']}] {doc}")
    
    context = "\n\n".join(context_pieces)
    
    # 4. Generate answer
    response = ai_client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=500,
        messages=[{
            "role": "user",
            "content": f"""You are a compliance expert. Use ONLY the provided regulatory information to answer.

REGULATORY CONTEXT:
{context}

QUESTION: {question}

Instructions:
- Answer based strictly on the provided context
- Cite the specific regulation (GDPR, PSD2, etc.)
- If information is incomplete, say so
- Keep answer concise but complete"""
        }]
    )
    
    return {
        "question": question,
        "answer": response.content[0].text,
        "sources": [meta["regulation"] for meta in metadatas],
        "retrieved_chunks": retrieved_docs
    }

# Test the system
test_questions = [
    "What authentication factors are required for EU payments?",
    "How long must investment firms keep transaction records?",
    "What is the minimum CET1 capital ratio?",
    "What is the right to erasure under GDPR?"
]

for question in test_questions:
    result = compliance_rag(question)
    print(f"\nQ: {result['question']}")
    print(f"A: {result['answer']}")
    print(f"Sources: {', '.join(result['sources'])}")
    print("-" * 60)
```

**Challenge:** Add a UI with Gradio or Streamlit. Add 20+ real regulatory documents. Evaluate answer quality.

### Required Enterprise Extensions

Add these before submitting the lab:

1. **ACL metadata:** add `tenant_id`, `classification`, `allowed_groups`, and `source_status` to each indexed document.
2. **Permission filter:** block unauthorized chunks before building the prompt.
3. **Retrieval metrics:** report top-k source IDs, similarity scores, and whether the expected source was retrieved.
4. **Citation scoring:** check whether the answer cites a retrieved approved source.
5. **Prompt-injection test:** include at least one malicious document that says to ignore instructions, and prove the answer does not follow it.
6. **Deletion test:** remove one source document, rebuild or update the index, and prove it is no longer retrieved.

### Lab Submission

Submit:

- `rag_app.py` or notebook with the working RAG flow.
- `rag_eval_cases.jsonl` with at least 10 questions and expected source IDs.
- `rag_eval_results.json` with retrieval hit rate, citation pass rate, and failed cases.
- `access-control-test.md` showing one allowed query and one blocked query.
- `prompt-injection-test.md` showing the malicious document test and outcome.
- `README.md` with setup, assumptions, and known limitations.

### Pass/Fail Standard

| Requirement | Pass standard |
|-------------|---------------|
| Retrieval | Expected source appears in top 3 for at least 80% of eval cases |
| Citations | At least 90% of answers cite an approved retrieved source |
| Access control | Unauthorized user cannot retrieve restricted chunks |
| Tenant isolation | Cross-tenant query returns zero private chunks |
| Prompt injection | Malicious retrieved text cannot override system instructions |
| Deletion | Removed source no longer appears in retrieval results |

---

*Move to [Module 07 — Agents & Workflows](/tutorials/llm-mastery/intermediate/06-agents-workflows-tool-safety)*