The Core Distinction
Every AI feature decision comes down to a fundamental question: Do you need the model to know more, or to behave differently?
- Know more → RAG (inject knowledge at query time) or fine-tuning on knowledge (rare and usually wrong)
- Behave differently → Prompting (first) or fine-tuning (when prompting has been exhausted)
Teams waste months and thousands of dollars fine-tuning when a few hours of prompt engineering would solve the problem. This framework prevents that.
The Three Approaches
Prompting
What it does: Shapes the model’s behavior by giving it explicit instructions in the system prompt and few-shot examples.
Cost: Near-zero. Writing prompts takes hours to days. No infrastructure changes.
What it solves: Tone, format, persona, reasoning style, output structure, task framing.
Limitations: Cannot teach the model new facts. Cannot reliably override deep training. Has a ceiling for complex behaviors.
When to use it first: Always. Before any other approach.
RAG
What it does: Retrieves relevant documents at query time and injects them into the prompt as context.
Cost: Moderate. You need a vector database, an embedding pipeline, and maintenance of the document corpus. $100-$2,000/month depending on scale.
What it solves: Knowledge that changes over time. Private or proprietary information. Large corpora that can’t fit in a single prompt. Questions that require specific facts.
Limitations: Retrieval quality determines answer quality. Cannot change the model’s reasoning style or output format. Adds latency.
When to use it: When the model needs access to information it wasn’t trained on, or information that changes.
Fine-tuning
What it does: Creates a new model checkpoint by training the base model on your dataset of (prompt, ideal_response) pairs.
Cost: High. Training costs $500-$5,000+ depending on model size and dataset. Then there’s inference cost (fine-tuned models often cost more per token than base models), evaluation infrastructure, deployment pipeline, and ongoing maintenance.
What it solves: Consistent style and format adherence that prompting can’t reliably achieve. Specific behaviors deeply embedded in the model. Latency reduction (compressed few-shot examples into weights).
Limitations: Dataset quality is everything - garbage training data produces a garbage model. Fine-tuned models go stale when the world changes. Requires its own eval suite and deployment pipeline. Cannot easily update for new knowledge.
When to use it: When you have 100+ high-quality (prompt, response) examples, have exhausted prompting and RAG, and need consistent behavior that prompting cannot achieve.
Decision Framework
Fine-tuning vs RAG vs Prompting Decision Flowchart
flowchart TD
START([New AI Feature]) --> Q1{Does the model need
access to your private
or changing knowledge?}
Q1 -->|Yes| Q2{Is the knowledge corpus
larger than fits in
a single prompt?}
Q1 -->|No| Q3{Does the current model
behavior fall short
with good prompting?}
Q2 -->|Yes| RAG([Use RAG
Vector DB retrieval])
Q2 -->|No| Q4{Does the knowledge
change frequently?}
Q4 -->|Yes| RAG
Q4 -->|No| Q5{Do you have 100+
quality examples?}
Q3 -->|No| PROMPT([Improve Your Prompt
Few-shot examples
Clear instructions])
Q3 -->|Yes| Q6{Is the issue about
knowledge or behavior?}
Q6 -->|Knowledge| RAG
Q6 -->|Behavior| Q7{Have you exhausted
prompt engineering
including few-shot?}
Q7 -->|No| PROMPT
Q7 -->|Yes| Q5
Q5 -->|No| PROMPT
Q5 -->|Yes| Q8{Do you have budget
for ongoing maintenance?}
Q8 -->|No| PROMPT
Q8 -->|Yes| FT([Fine-tune
Train custom model])
style START fill:#dbeafe,stroke:#2563eb,color:#1d4ed8
style PROMPT fill:#dcfce7,stroke:#16a34a,color:#15803d
style RAG fill:#fef3c7,stroke:#d97706,color:#92400e
style FT fill:#f3e8ff,stroke:#7c3aed,color:#7c3aed
flowchart TD
START([New AI Feature]) --> Q1{Does the model need
access to your private
or changing knowledge?}
Q1 -->|Yes| Q2{Is the knowledge corpus
larger than fits in
a single prompt?}
Q1 -->|No| Q3{Does the current model
behavior fall short
with good prompting?}
Q2 -->|Yes| RAG([Use RAG
Vector DB retrieval])
Q2 -->|No| Q4{Does the knowledge
change frequently?}
Q4 -->|Yes| RAG
Q4 -->|No| Q5{Do you have 100+
quality examples?}
Q3 -->|No| PROMPT([Improve Your Prompt
Few-shot examples
Clear instructions])
Q3 -->|Yes| Q6{Is the issue about
knowledge or behavior?}
Q6 -->|Knowledge| RAG
Q6 -->|Behavior| Q7{Have you exhausted
prompt engineering
including few-shot?}
Q7 -->|No| PROMPT
Q7 -->|Yes| Q5
Q5 -->|No| PROMPT
Q5 -->|Yes| Q8{Do you have budget
for ongoing maintenance?}
Q8 -->|No| PROMPT
Q8 -->|Yes| FT([Fine-tune
Train custom model])
style START fill:#dbeafe,stroke:#2563eb,color:#1d4ed8
style PROMPT fill:#dcfce7,stroke:#16a34a,color:#15803d
style RAG fill:#fef3c7,stroke:#d97706,color:#92400e
style FT fill:#f3e8ff,stroke:#7c3aed,color:#7c3aed
Cost vs Complexity Matrix
| Approach | Time to Ship | One-time Cost | Monthly OpEx | Knowledge Updates | Behavior Changes |
|---|---|---|---|---|---|
| Prompting | Days | $0 | $0 | Instant | Easy |
| RAG | Weeks | $1K-$5K | $100-$2K | Incremental | Limited |
| Fine-tuning | Months | $5K-$50K | $2K-$20K | New training run | Excellent |
The numbers are representative for a mid-size enterprise application. Fine-tuning costs 10-100× more than RAG in setup, and RAG costs 10-100× more than prompting.
Common Mistakes
Mistake 1: Fine-tuning for knowledge problems. A legal team wants the AI to know their internal case law database. They train a fine-tuned model on 10,000 legal documents. Three months later, new rulings are issued, and the model’s knowledge is stale. They needed RAG, not fine-tuning. Knowledge belongs in a retrieval system that can be updated cheaply.
Mistake 2: RAG for style problems. A company wants all AI output to follow their specific communication style guide - short sentences, no passive voice, specific terminology. They build a RAG system that retrieves style guide excerpts. The style guide appears in every prompt but the model ignores it inconsistently. They needed fine-tuning (or at minimum, aggressive few-shot prompting). Style is a behavior, not knowledge.
Mistake 3: Skipping prompting. An engineering team immediately proposes fine-tuning because “the base model doesn’t do what we want.” Two sprints later, they have a dataset and are building infrastructure. A product manager asks: “Did you try putting that requirement in the system prompt?” They had not. They needed 30 minutes of prompt engineering, not a 6-week fine-tuning project.
Before committing to fine-tuning, run this test: write a system prompt with 5 high-quality few-shot examples of the behavior you want. If the model produces correct output 80%+ of the time on your test cases, prompting is sufficient. Only if you cannot reach acceptable quality with excellent few-shot examples should you consider fine-tuning.
The 6 Key Questions
Before any AI feature decision, answer these six questions:
-
Is the required information static or dynamic? Static (company values, procedures that rarely change) → prompting or fine-tuning. Dynamic (news, documents, databases that update) → RAG.
-
Is the problem about knowing or behaving? Knowing → RAG. Behaving → prompting first, fine-tuning last.
-
What is the budget? Under $1K → prompting only. $1K-$20K → RAG if needed. Over $20K → fine-tuning is on the table.
-
What is the latency requirement? Sub-100ms → prompting only (no retrieval). Under 500ms → RAG is viable. Fine-tuned models are fastest per-token but retrieval adds latency.
-
Do you have labeled training data? No → you cannot fine-tune yet. Creating training data is a project itself. Yes, 100+ examples → fine-tuning is possible. Yes, 1000+ examples → fine-tuning will likely work.
-
Who maintains the model? In-house ML team → fine-tuning is feasible. No ML team → prompting and RAG only.
Your domain knowledge is the secret ingredient in this decision. Engineers can build any of the three pipelines - but only you know what “good enough” looks like for the business. When evaluating options, translate the matrix into business terms: RAG means the AI always works with current data but costs more to maintain; fine-tuning means consistent behavior but becomes stale. Present it to stakeholders as: “Do we need the AI to know more, or behave differently? RAG for knowing more, fine-tuning for behaving differently.”
Fine-tuning takes weeks and costs $K+. Prompt engineering takes days and costs $0. Exhaust prompting before considering fine-tuning - this is not a technical preference, it is a product velocity decision. When the team proposes fine-tuning, ask: “What happens if we put the best possible instructions and 5 examples in the system prompt? Have we tried that?” Set a policy: fine-tuning requires PM sign-off after prompting has been tried and documented.
Never fine-tune on your first version of a feature. Ship with prompting, measure with evals, gather real user data, and use that data to improve your prompts. If you have 3 months of production traffic showing where the model falls short, you have the foundation for a fine-tuning dataset. Fine-tuning on synthetic or theoretical examples usually underperforms on real traffic.
Quick Reference: The One-Line Rules
- Use prompting when: you haven’t tried it yet
- Use RAG when: the model needs to know your data
- Use fine-tuning when: you have exhausted prompting, have 100+ labeled examples, have a budget, and have a maintenance plan
Decision Framework: Score Your Feature Against Each Approach
Example code (static). Copy and run locally in your own environment.
from dataclasses import dataclass, field
@dataclass
class FeatureRequirements:
# Knowledge characteristics
knowledge_is_private: bool = False
knowledge_changes_frequently: bool = False
corpus_too_large_for_prompt: bool = False
# Behavior characteristics
requires_style_consistency: bool = False
prompting_quality_acceptable: bool = True # set False if base prompting fails
# Resources
budget_usd: int = 0
labeled_examples_available: int = 0
has_ml_team: bool = False
latency_budget_ms: int = 2000
# Risk tolerance
acceptable_maintenance_overhead: bool = True
@dataclass
class Recommendation:
approach: str
confidence: str # high | medium | low
reasons: list[str] = field(default_factory=list)
warnings: list[str] = field(default_factory=list)
def recommend(req: FeatureRequirements) -> Recommendation:
reasons = []
warnings = []
# ── Rule: Always start with prompting evaluation ──────────────────────────
if req.prompting_quality_acceptable:
return Recommendation(
approach="Prompting",
confidence="high",
reasons=["Base model + good prompt meets quality requirements"],
warnings=["Re-evaluate if quality degrades at scale"],
)
# ── Rule: RAG for knowledge problems ─────────────────────────────────────
needs_rag = (
req.knowledge_is_private
or req.knowledge_changes_frequently
or req.corpus_too_large_for_prompt
)
if needs_rag:
reasons.append("Knowledge is private, dynamic, or too large for context")
if req.latency_budget_ms < 300:
warnings.append("RAG adds 100-300ms latency - may not meet latency budget")
if req.budget_usd < 1000:
warnings.append("RAG infra costs ~$100-500/month minimum")
return Recommendation(
approach="RAG",
confidence="high",
reasons=reasons,
warnings=warnings,
)
# ── Rule: Fine-tuning only if conditions are met ──────────────────────────
if req.requires_style_consistency:
reasons.append("Consistent style/behavior required that prompting cannot achieve")
if req.labeled_examples_available < 100:
warnings.append("Only " + str(req.labeled_examples_available) + " labeled examples - need 100+ for fine-tuning")
return Recommendation(
approach="Prompting (collect more data first)",
confidence="medium",
reasons=reasons,
warnings=warnings,
)
if req.budget_usd < 5000:
warnings.append("Budget $" + str(req.budget_usd) + " may be insufficient for fine-tuning + maintenance")
return Recommendation(
approach="Prompting (insufficient fine-tuning budget)",
confidence="medium",
reasons=reasons,
warnings=warnings,
)
if not req.has_ml_team:
warnings.append("No ML team - fine-tuning without ML expertise has high failure rate")
if not req.acceptable_maintenance_overhead:
warnings.append("Fine-tuned models require ongoing eval and retraining - significant overhead")
confidence = "high" if req.has_ml_team and not warnings else "medium"
return Recommendation(
approach="Fine-tuning",
confidence=confidence,
reasons=reasons,
warnings=warnings,
)
# ── Demo: Score three different features ──────────────────────────────────────
scenarios = [
(
"Customer FAQ chatbot (internal docs)",
FeatureRequirements(
knowledge_is_private=True,
knowledge_changes_frequently=True,
prompting_quality_acceptable=False,
budget_usd=3000,
),
),
(
"Brand-voice content generator",
FeatureRequirements(
requires_style_consistency=True,
prompting_quality_acceptable=False,
labeled_examples_available=500,
budget_usd=15000,
has_ml_team=True,
),
),
(
"Summarize support tickets",
FeatureRequirements(
prompting_quality_acceptable=True,
),
),
]
for name, req in scenarios:
rec = recommend(req)
print(f"Feature: {name}")
print(f" Recommendation: {rec.approach} (confidence: {rec.confidence})")
for r in rec.reasons:
print(f" + {r}")
for w in rec.warnings:
print(f" ! {w}")
print()
Fine-tuned models need their own eval suites and deployment pipelines. The operational cost of fine-tuning often exceeds the training cost. Budget for maintenance, not just creation. You need: a labeled eval dataset (ongoing curation), a retraining pipeline (for when the model drifts), a deployment pipeline separate from your base model, and a rollback procedure if the fine-tuned model regresses. Teams routinely budget $10K for training and discover the first year of operations costs $40K.
Interview Notes: SFT, RLHF, Constitutional AI, LoRA, QLoRA, and DPO
Supervised fine-tuning (SFT) teaches examples of desired behavior. RLHF optimizes against human preference rewards. Constitutional AI uses written principles and critique/revision to reduce reliance on direct human labels. LoRA and QLoRA adapt models efficiently with low-rank adapters; QLoRA quantizes the base model to reduce GPU memory. DPO trains directly from preference pairs without a separate reward model.
Use fine-tuning for behavior, style, format, and domain patterns. Use RAG for changing/private knowledge. Do not fine-tune secrets into a model.
Interview Practice
- When is prompting enough?
- When should you choose RAG over fine-tuning?
- What behavior is fine-tuning good at changing?
- Compare SFT, RLHF, Constitutional AI, DPO, LoRA, and QLoRA.
- Why should secrets not be fine-tuned into a model?
- How would you decide using cost, latency, privacy, and quality?