LLM Mastery for Enterprise AI Engineering / Intermediate Track Module 2 / 8

LLM Mastery for Enterprise AI Engineering Intermediate ⏱ 55 min

DEVQABAPMEXEC

Fine-Tuning with LoRA, QLoRA, DPO, and RLHF

How to customize models responsibly and prove the tuned model is better than the baseline.

How to Use This Lesson

Start with the user problem, then map the pattern to architecture and failure modes.
If a code or design example is included, change one assumption and reason through the impact.
Use role callouts, checklists, and Q&A sections as implementation or interview prep notes.

Prerequisites: Datasets, Training, and Data Governance

Free · Subscriber Access

LLM Mastery for Enterprise AI Engineering

Free subscriber access. Enter your email to unlock all 18 modules, track your progress, and export your enterprise AI readiness packet.

Foundation to Advanced — tokens and transformers to deployment readiness and enterprise governance.
12 enterprise deliverables — data cards, eval reports, deployment reviews, governance packets.
Browser-local progress — your completion data stays private, no account needed.

LLM Mastery course page. This lesson is part 2 of 8 in the intermediate track. Use the lab and assessment sections as the completion standard, not optional reading.

Required mastery artifact: by the end of this lesson, update the running enterprise readiness packet for a realistic use case. Treat examples and vendor names as dated illustrations; defend decisions with current model, cost, risk, and evaluation evidence.

Module 03 — Fine-Tuning

The real engineering: making a model yours. LoRA, QLoRA, DPO, RLHF, Quantization, Checkpoints, Adapters, GGUF.

01 — LoRA: Low-Rank Adaptation

The Problem LoRA Solves

Full fine-tuning means updating ALL parameters of a model.

For LLaMA 3 8B:

8 billion parameters
Each stored as fp16 (2 bytes)
Plus gradients (same size)
Plus optimizer states (2x parameters for Adam)
= ~80+ GB VRAM just to fine-tune

That’s 10x A100 80GB GPUs. For a single engineer, prohibitive.

LoRA says: You don’t need to update all 8 billion parameters. You can get 90%+ of the benefit by updating a tiny fraction of them.

How LoRA Works

Here’s the key insight:

When we fine-tune a model, the change to the weight matrices is actually low-rank. This means the change can be approximated by two small matrices.

The math (don’t panic):

Original weight matrix W: (4096 × 4096) = 16 million numbers

Instead of updating W directly, LoRA trains two small matrices:

A: (4096 × 8) = 32,768 numbers
B: (8 × 4096) = 32,768 numbers

Then the effective update is: W_new = W + B × A

The rank (r=8 here) is a hyperparameter. Common values: 4, 8, 16, 32, 64.

Original: Update 16,000,000 parameters
LoRA r=8: Update 65,536 parameters
Reduction: ~244x fewer parameters to train!

LoRA in Practice

from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM

# Load the base model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-8B",
    torch_dtype=torch.float16,
    device_map="auto"
)

# Configure LoRA
lora_config = LoraConfig(
    r=16,                    # Rank — higher = more capacity but more params
    lora_alpha=32,           # Scaling factor (usually 2x rank)
    target_modules=[         # Which layers to apply LoRA to
        "q_proj",            # Query projection in attention
        "k_proj",            # Key projection
        "v_proj",            # Value projection
        "o_proj",            # Output projection
        "gate_proj",         # Feed-forward layers
        "up_proj",
        "down_proj",
    ],
    lora_dropout=0.05,       # Dropout for regularization
    bias="none",             # Don't train biases
    task_type="CAUSAL_LM"    # Task type
)

# Apply LoRA to the model
model = get_peft_model(model, lora_config)

# See how many parameters we're actually training
model.print_trainable_parameters()
# Output: trainable params: 83,886,080 || all params: 8,030,261,248 || trainable%: 1.04%

# Only 1% of parameters! That's the power of LoRA

Choosing LoRA Rank (r)

Rank	Use Case
r=4	Simple style/format changes
r=8	Moderate task adaptation
r=16	Complex task fine-tuning
r=32	Major behavioral changes
r=64	Near full fine-tuning territory

Higher rank = more parameters = more capacity = slower training = more memory

Start with r=16, adjust based on results.

Target Modules: Where to Apply LoRA

Not all layers benefit equally:

# Common configurations:

# Attention-only (conservative, fast)
target_modules = ["q_proj", "v_proj"]

# Attention + output (common default)
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj"]

# All linear layers (maximum coverage)
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", 
                  "gate_proj", "up_proj", "down_proj"]

# Including embeddings (for multilingual/new vocabulary)
target_modules = ["embed_tokens", "q_proj", "k_proj", "v_proj", 
                  "o_proj", "gate_proj", "up_proj", "down_proj", "lm_head"]
```

For most fine-tuning tasks: target all attention + feed-forward projections.

---

## LoRA Merging

After training, you can merge the LoRA adapters back into the base model:

```python
from peft import PeftModel

# Load base model
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B")

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "path/to/lora/adapter")

# Merge adapters into base model
merged_model = model.merge_and_unload()

# Save merged model (now it's a standalone model without needing the adapter separately)
merged_model.save_pretrained("./merged-model")
```

Benefits of merging:
- Single file to deploy
- No overhead at inference time
- Can quantize the merged model

---

# 02 — QLoRA: Quantized LoRA

## Making LoRA Even More Accessible

LoRA reduced training parameters by 100x. QLoRA reduces memory requirements by another 4-8x by also quantizing the base model.

**QLoRA = Quantize the base model to 4-bit + Apply LoRA adapters in 16-bit**

```
Full fine-tuning 70B:  ~1,400 GB VRAM (impossible on anything reasonable)
LoRA on 70B in fp16:   ~160 GB VRAM (need 2× A100 80GB minimum)
QLoRA on 70B in 4-bit: ~48 GB VRAM (1× A100 80GB!)

How QLoRA Works

Quantize the base model to 4-bit (using NF4 quantization)
- Model weights stored as 4-bit integers instead of 16-bit floats
- 4x memory reduction
Apply LoRA adapters in bfloat16
- The small LoRA adapter matrices remain in full precision
- Gradients flow through both
Double quantization
- Also quantize the quantization constants
- Extra ~0.5-1 GB savings
Paged optimizers
- Optimizer states use CPU RAM when GPU is full
- Prevents OOM crashes

QLoRA in Practice (Using Unsloth — recommended)

# Unsloth makes QLoRA dramatically easier and 2-5x faster
# pip install unsloth

from unsloth import FastLanguageModel
import torch

# Load model in 4-bit automatically
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="meta-llama/Meta-Llama-3-8B-Instruct",
    max_seq_length=2048,
    dtype=None,      # Auto-detect best dtype
    load_in_4bit=True,  # QLoRA: load base in 4-bit
)

# Add LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",  # Reduces memory further
    random_state=42,
)

# Memory: ~8-10 GB for 8B model on consumer GPU!

Hardware Requirements with QLoRA

Model	Without QLoRA	With QLoRA	Consumer Hardware
7-8B	~14 GB	~4-5 GB	RTX 3060 12GB ✓
13B	~26 GB	~8 GB	RTX 3090 24GB ✓
34B	~68 GB	~20 GB	RTX 4090 24GB (barely)
70B	~140 GB	~40 GB	2× RTX 4090

QLoRA democratized LLM fine-tuning. You can fine-tune a state-of-the-art 7B model on a gaming GPU.

03 — DPO: Direct Preference Optimization

The Problem with RLHF

Traditional RLHF (coming next) requires training a separate reward model and using complex RL algorithms. This is:

Complicated to implement
Unstable (RL training can diverge)
Slow and memory-intensive

DPO (2023) achieved the same goal with a simpler approach: skip the reward model entirely.

How DPO Works

DPO directly trains the model to:

Increase the probability of “chosen” responses
Decrease the probability of “rejected” responses

from trl import DPOTrainer, DPOConfig

# Your preference dataset
# {"prompt": "...", "chosen": "...", "rejected": "..."}

dpo_config = DPOConfig(
    beta=0.1,        # Controls deviation from reference model
                     # Higher = stay closer to base model behavior
    output_dir="./dpo-output",
    per_device_train_batch_size=2,
    num_train_epochs=3,
    learning_rate=5e-5,
)

trainer = DPOTrainer(
    model=model,           # The model to train
    ref_model=ref_model,   # Reference model (frozen copy of base)
    tokenizer=tokenizer,
    train_dataset=dataset,
    args=dpo_config,
)

trainer.train()

The Beta Parameter

Beta (β) controls how much the model can deviate from the original (reference) model.

β = 0.01: Very free to change, might drift far from original capabilities
β = 0.1:  Balanced (common default)
β = 0.5:  Conservative, stays close to base model
β = 1.0:  Very conservative
```

Low beta → stronger preference optimization, but might "forget" original capabilities.

---

## DPO vs SFT: Use Both

Typical pipeline:
```
1. SFT on chosen responses → teaches the model WHAT good responses look like
2. DPO on preference pairs → teaches it WHY one response is BETTER than another
```

DPO without SFT can be unstable. SFT without DPO lacks quality differentiation.

---

## DPO Variants

| Method | When to Use |
|--------|-------------|
| DPO | Standard preference optimization |
| IPO | When DPO overfits to preference data |
| KTO | When you only have good/bad labels, not pairs |
| ORPO | Combined SFT + DPO in one pass (efficient) |
| SimPO | Simplified, no reference model needed |

For most projects, start with ORPO (combined SFT+DPO) — it's simpler and competitive.

---

# 04 — RLHF: Reinforcement Learning from Human Feedback

## The Original Alignment Technique

RLHF is how ChatGPT was trained to be helpful and harmless. It's more complex than DPO but remains important for understanding the field.

---

## RLHF in Three Stages

### Stage 1: SFT (Supervised Fine-Tuning)
Train the model on instruction-response pairs.
Same as what we covered in Module 02.

### Stage 2: Reward Model Training
Train a separate model to score responses:

```
Prompt: "Explain quantum computing"
Response A: [clear, accurate explanation] → Reward: 8.5
Response B: [confusing, slightly wrong]   → Reward: 4.2
Response C: [excellent, with examples]   → Reward: 9.1
```

The reward model learns human preferences from pairwise comparisons:
```json
{"prompt": "...", "chosen": "response A", "rejected": "response B"}

Stage 3: RL Training (PPO)

Use the reward model to improve the policy (language model):

1. Generate a response from the SFT model
2. Score it with the reward model
3. Use PPO (Proximal Policy Optimization) to adjust the model
   toward responses the reward model would score higher
4. Also penalize diverging too far from the SFT model (KL penalty)
5. Repeat millions of times

Why RLHF is Powerful

RLHF can teach things that are hard to express in supervised examples:

“Don’t be sycophantic (don’t just agree to please)"
"Be helpful but honest"
"Prefer concise answers unless depth is needed”

These nuanced preferences emerge from the reward model’s learning.

Why DPO Often Beats RLHF in Practice

Factor	RLHF	DPO
Complexity	Very high	Moderate
Stability	Can diverge	Generally stable
Memory	Need reward model + policy	Just policy
Speed	Slow	2-3x faster
Results	Excellent	Competitive

For most practitioners: start with DPO. RLHF for large-scale production systems.

05 — Quantization

What is Quantization?

Quantization = storing model parameters in lower precision (fewer bits per number).

Analogy: If weights are like measurements, quantization is like rounding from 4 decimal places to 1 decimal place.

Full precision: 0.23847183 (32 bits)
Half precision: 0.2385     (16 bits)
8-bit integer:  24         (8 bits, scaled)
4-bit integer:  6          (4 bits, scaled further)
```

Information is lost, but often surprisingly little.

---

## Precision Types Compared

| Format | Bits | Range | Memory for 7B | Quality |
|--------|------|-------|--------------|---------|
| fp32 | 32 | ±3.4×10^38 | ~28 GB | Baseline |
| bf16 | 16 | ±3.4×10^38 | ~14 GB | ≈fp32 |
| fp16 | 16 | ±65,504 | ~14 GB | ≈fp32 |
| int8 | 8 | -128 to 127 | ~7 GB | ~99% of fp16 |
| int4 | 4 | -8 to 7 | ~3.5 GB | ~95-98% of fp16 |
| int2 | 2 | -2 to 1 | ~1.75 GB | ~80-90% of fp16 |

For most use cases, **Q4 or Q5** quantization is the sweet spot: 4-5x smaller, minimal quality loss.

---

## Types of Quantization

### Post-Training Quantization (PTQ) — Most Common
After training, convert the weights to lower precision.
No additional training needed.

```python
# Using bitsandbytes for 4-bit quantization
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,   # QLoRA's double quant
    bnb_4bit_quant_type="nf4",        # NormalFloat4 (best for weights)
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-8B",
    quantization_config=quantization_config,
    device_map="auto"
)

Quantization-Aware Training (QAT)

Train the model with quantization in mind. Better quality, more expensive.

GGUF Quantization (for llama.cpp / Ollama)

Specific quantization format for CPU/consumer hardware inference. Covered in section 08.

Common Quantization Levels in GGUF

When you download models from Hugging Face for Ollama:

Level	Quality	Size (7B model)
Q2_K	Poor	~2.8 GB
Q3_K_M	Low-Medium	~3.6 GB
Q4_K_M	Good	~4.5 GB
Q5_K_M	Very Good	~5.7 GB
Q6_K	Excellent	~6.7 GB
Q8_0	Near-perfect	~9.0 GB
F16	Perfect	~14 GB

Recommendation: Q4_K_M for low memory, Q5_K_M or Q6_K if you have room.

06 — Model Checkpoints

What is a Checkpoint?

During training, the model is saved periodically. Each saved version is called a checkpoint.

Why checkpoints matter:

Recovery: If training crashes, resume from last checkpoint
Selection: Training might peak at epoch 2, not epoch 5. Pick the best checkpoint.
Comparison: Compare different checkpoints to find optimal training length
Sharing: Save a checkpoint to share or deploy

Checkpoint Strategy

from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./checkpoints",
    
    # Save every N steps
    save_steps=200,
    
    # Keep only the last N checkpoints (saves disk space)
    save_total_limit=3,
    
    # Save the best model based on eval loss
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    greater_is_better=False,
    
    # Evaluate every N steps
    eval_steps=200,
    evaluation_strategy="steps",
)

What’s Inside a Checkpoint?

checkpoint-1000/
├── config.json              # Model architecture
├── tokenizer.json           # Tokenizer
├── tokenizer_config.json    
├── adapter_model.safetensors  # LoRA adapter weights (if using LoRA)
├── adapter_config.json      # LoRA configuration
├── optimizer.pt             # Optimizer state (for resuming training)
├── scheduler.pt             # Learning rate scheduler state
└── trainer_state.json       # Training metrics and state
```

SafeTensors format (.safetensors) is preferred over .pt or .bin — it's faster to load and more secure.

---

## Resuming from Checkpoint

```python
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
)

# Resume from specific checkpoint
trainer.train(resume_from_checkpoint="./checkpoints/checkpoint-1000")

07 — Adapter Tuning

The Adapter Ecosystem

”Adapters” is the general term for modular fine-tuning techniques. LoRA is the most popular, but there are others:

Prefix Tuning

Add learnable “prefix tokens” to the input. The model learns to condition on these.

from peft import PrefixTuningConfig

config = PrefixTuningConfig(
    task_type="CAUSAL_LM",
    num_virtual_tokens=20,  # 20 learned prefix tokens
)

Prompt Tuning

Even simpler: only learn the embeddings of a few tokens prepended to every input. Very parameter-efficient, but typically lower quality than LoRA.

IA3 (Infused Adapter by Inhibiting and Amplifying Inner Activations)

Multiply (not add) small learned vectors into attention and feed-forward layers. Even fewer parameters than LoRA, but less powerful.

Adapter Layers (Classic)

Add small bottleneck networks between transformer layers. Less popular now that LoRA exists.

Adapter Comparison

Method	Params	Quality	Memory	Speed
Full fine-tune	100%	★★★★★	Very High	Slow
LoRA	~1%	★★★★	Low	Fast
QLoRA	~1%	★★★★	Very Low	Fast
IA3	~0.01%	★★★	Lowest	Fastest
Prefix Tuning	~0.1%	★★★	Low	Fast
Prompt Tuning	~0.001%	★★	Minimal	Fastest

For most practitioners: LoRA/QLoRA is the right choice. Start there.

Mixing Multiple Adapters

You can load and switch adapters dynamically:

from peft import PeftModel

# Load base model
base_model = AutoModelForCausalLM.from_pretrained("llama-3-8b")

# Load multiple LoRA adapters
model = PeftModel.from_pretrained(base_model, "lora-customer-service", adapter_name="customer")
model.load_adapter("lora-compliance", adapter_name="compliance")
model.load_adapter("lora-coding", adapter_name="coding")

# Switch between tasks
model.set_adapter("customer")    # Now behaves like customer service model
response1 = model.generate(...)

model.set_adapter("compliance")  # Now behaves like compliance model
response2 = model.generate(...)
```

This is powerful for multi-task systems without needing multiple full models.

---

# 08 — GGUF Models

## What is GGUF?

GGUF (GPT-Generated Unified Format) is a file format for storing quantized models optimized for CPU inference with **llama.cpp**.

It replaced the older GGML format in 2023.

When you download a model from Ollama or run it locally on your Mac, you're likely using GGUF.

---

## Why GGUF Matters

1. **CPU inference**: GGUF models can run on CPU (slowly) — no GPU needed
2. **Apple Silicon**: Excellent support for Mac M1/M2/M3 via Metal GPU
3. **Quantized**: Already quantized to various levels (Q4, Q5, Q8...)
4. **Single file**: Everything in one .gguf file — easy to download and use
5. **Ollama/LM Studio**: These tools use GGUF under the hood

---

## Converting to GGUF

After fine-tuning, you might want to convert your model to GGUF for local inference:

```bash
# Install llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

# Convert HuggingFace model to GGUF
python convert_hf_to_gguf.py \
    /path/to/your/merged-model \
    --outfile my-model.gguf \
    --outtype f16

# Quantize the GGUF to Q4_K_M
./llama-quantize my-model.gguf my-model-Q4_K_M.gguf Q4_K_M

Loading GGUF Models

# Using llama-cpp-python
# pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama(
    model_path="./my-model-Q4_K_M.gguf",
    n_ctx=4096,         # Context window
    n_gpu_layers=-1,    # Use all GPU layers (if GPU available)
    n_threads=8,        # CPU threads
)

response = llm.create_chat_completion(
    messages=[
        {"role": "user", "content": "What is compliance automation?"}
    ],
    max_tokens=512,
    temperature=0.7
)

print(response['choices'][0]['message']['content'])

📝 Module 03 Summary

Concept	Key Takeaway
LoRA	Train only ~1% of parameters using low-rank matrices. Same result, 100x cheaper.
QLoRA	Quantize base model + LoRA adapters. Fine-tune 8B on a gaming GPU.
DPO	Simpler RLHF alternative. Trains on chosen/rejected pairs directly.
RLHF	Original alignment technique. Powerful, complex, requires reward model.
Quantization	Reduce precision (32→4 bit) for 4-8x size reduction with ~2-5% quality loss.
Checkpoints	Save training state periodically. Pick the best one.
Adapters	Modular fine-tuning approach. LoRA is the dominant technique.
GGUF	Quantized model format for local CPU/GPU inference. Used by Ollama.

🧠 Mental Model

Base Model (massive, general knowledge)
    ↓ [4-bit quantization = load onto consumer GPU]
Quantized Base Model (same knowledge, smaller)
    ↓ [LoRA = train tiny adapter matrices]
Fine-tuned Adapter (specialized for your task)
    ↓ [merge or keep separate]
Deployable Model
    ↓ [convert to GGUF for local use]
Local Model (runs on your laptop)

❌ Beginner Mistakes

Full fine-tuning on consumer hardware — Use QLoRA. Always.
Setting rank too high — Start with r=16. Go higher only if quality is lacking.
Training too many epochs — 1-3 epochs is usually optimal for SFT
Skipping validation — Watch your eval loss, not just train loss
Wrong target modules — Check the model architecture, not all modules are named the same
Forgetting to merge before GGUF conversion — The base model + adapter must be merged first

🏋️ Module Exercise

Fine-tune a small model with QLoRA (on Google Colab — free GPU):

Enterprise Lab Evidence

Submit these artifacts with the lab:

environment validation: GPU type, CUDA/Colab runtime, package versions
data card for the training and test examples
base-model baseline answers before fine-tuning
training log with loss curve or step output
tuned-model eval results on a locked test set
failure analysis with at least 3 regressions or weak answers
rollback note explaining how to return to the base model or previous adapter

Pass/fail gate:

Requirement	Pass standard
Environment	Runtime can load model, train, and generate without manual hidden steps
Baseline	Base model output is captured before training
Evaluation	Tuned model is compared against baseline on held-out examples
Regression check	General capability and refusal behavior are spot-checked
Reproducibility	Dataset version, model version, hyperparameters, and seed are recorded

# Full working example in Google Colab (T4 GPU, free tier)
# Runtime: ~30 minutes for 1 epoch on a tiny dataset

# Step 1: Install
!pip install unsloth trl datasets -q

# Step 2: Load model with QLoRA
from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    "unsloth/llama-3-8b-Instruct-bnb-4bit",  # Pre-quantized
    max_seq_length=1024,
    load_in_4bit=True,
)

model = FastLanguageModel.get_peft_model(
    model,
    r=8,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
)

# Step 3: Prepare dataset (tiny example)
from datasets import Dataset

raw_data = [
    {"instruction": "What is GDPR?", 
     "output": "GDPR (General Data Protection Regulation) is an EU law that governs how organizations collect, store, and process personal data of EU citizens."},
    {"instruction": "What is PSD2?",
     "output": "PSD2 (Payment Services Directive 2) is an EU regulation requiring banks to open their APIs to third-party payment providers and implement Strong Customer Authentication."},
    # Add 50+ more examples for real training
]

def format_example(example):
    return {"text": f"""<|im_start|>user
{example['instruction']}<|im_end|>
<|im_start|>assistant
{example['output']}<|im_end|>"""}

dataset = Dataset.from_list(raw_data).map(format_example)

# Step 4: Train
from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=1024,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        num_train_epochs=3,
        learning_rate=2e-4,
        fp16=True,
        output_dir="./compliance-lora",
        logging_steps=10,
    )
)

trainer.train()

# Step 5: Test
from unsloth.chat_templates import get_chat_template
FastLanguageModel.for_inference(model)

messages = [{"role": "user", "content": "What is GDPR?"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
outputs = model.generate(inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

**Goal:** Get this running. Even with 5 examples, you'll see the model respond in a different style. Add more examples and see quality improve.

---

*Move to [Module 04 — Inference & Optimization](/tutorials/llm-mastery/intermediate/03-inference-optimization-serving)*