The Cost Case for Multi-Model
Using GPT-4 for every request is like hiring a senior architect to answer “what’s 2+2?” - expensive and unnecessary.
Real-world AI cost breakdown for a typical application:
- 70% of queries: simple classification, short extraction, yes/no decisions → cheap model
- 20% of queries: moderate complexity, multi-step reasoning → mid-tier model
- 10% of queries: complex reasoning, nuanced judgment, long-form → premium model
Routing intelligently can cut AI costs 60-80% with minimal quality impact.
Three Routing Strategies
1. Complexity-based routing - classify the query before sending it 2. Task-based routing - route by task type (summarization → cheap, reasoning → premium) 3. Cascade routing - try cheap model first, escalate if confidence is low
Model Routing Decision Tree
flowchart TD
Q[Incoming Query] --> CL[Classify Complexity]
CL --> S{Simple?
Classification
Extraction
Yes/No}
CL --> M{Moderate?
Summarization
Translation
Formatting}
CL --> C{Complex?
Multi-step reasoning
Code generation
Nuanced judgment}
S --> M1["gpt-4o-mini
~$0.15/1M tokens"]
M --> M2["gpt-4o
~$2.50/1M tokens"]
C --> M3["gpt-4o or claude-opus
~$15/1M tokens"]
style M1 fill:#dcfce7,stroke:#16a34a,color:#15803d
style M2 fill:#fef3c7,stroke:#d97706,color:#b45309
style M3 fill:#fee2e2,stroke:#dc2626,color:#dc2626
flowchart TD
Q[Incoming Query] --> CL[Classify Complexity]
CL --> S{Simple?
Classification
Extraction
Yes/No}
CL --> M{Moderate?
Summarization
Translation
Formatting}
CL --> C{Complex?
Multi-step reasoning
Code generation
Nuanced judgment}
S --> M1["gpt-4o-mini
~$0.15/1M tokens"]
M --> M2["gpt-4o
~$2.50/1M tokens"]
C --> M3["gpt-4o or claude-opus
~$15/1M tokens"]
style M1 fill:#dcfce7,stroke:#16a34a,color:#15803d
style M2 fill:#fef3c7,stroke:#d97706,color:#b45309
style M3 fill:#fee2e2,stroke:#dc2626,color:#dc2626
Fallback Chain Architecture
Single-model systems fail when the provider has an outage. Fallback chains maintain availability:
Fallback Chain Architecture
flowchart LR R[Request] --> P[Primary GPT-4o] P -->|success| OUT[Response] P -->|timeout / error| FB1[Fallback 1 Claude Sonnet] FB1 -->|success| OUT FB1 -->|error| FB2[Fallback 2 Gemini Pro] FB2 -->|success| OUT FB2 -->|error| ERR[Error Response + alert]flowchart LR R[Request] --> P[Primary GPT-4o] P -->|success| OUT[Response] P -->|timeout / error| FB1[Fallback 1 Claude Sonnet] FB1 -->|success| OUT FB1 -->|error| FB2[Fallback 2 Gemini Pro] FB2 -->|success| OUT FB2 -->|error| ERR[Error Response + alert]
Working Router Implementation
Query Complexity Classifier + Model Router
Example code (static). Copy and run locally in your own environment.
from dataclasses import dataclass
from enum import Enum
class Complexity(Enum):
SIMPLE = "simple"
MODERATE = "moderate"
COMPLEX = "complex"
@dataclass
class RoutingDecision:
complexity: Complexity
model: str
reason: str
estimated_cost_per_1k: float
def classify_complexity(query: str) -> Complexity:
"""Simple heuristic-based classifier (replace with ML model in production)."""
query_lower = query.lower()
# Simple signals
simple_patterns = ["is this", "does this", "yes or no", "classify", "extract the", "what is the"]
if any(p in query_lower for p in simple_patterns) and len(query.split()) < 20:
return Complexity.SIMPLE
# Complex signals
complex_patterns = ["analyze", "compare", "explain why", "design", "write a", "reasoning", "evaluate"]
word_count = len(query.split())
if any(p in query_lower for p in complex_patterns) or word_count > 80:
return Complexity.COMPLEX
return Complexity.MODERATE
def route_query(query: str) -> RoutingDecision:
"""Route a query to the appropriate model tier."""
complexity = classify_complexity(query)
routing_map = {
Complexity.SIMPLE: RoutingDecision(
complexity=complexity,
model="gpt-4o-mini",
reason="Simple classification/extraction task",
estimated_cost_per_1k=0.00015
),
Complexity.MODERATE: RoutingDecision(
complexity=complexity,
model="gpt-4o",
reason="Moderate complexity, standard model sufficient",
estimated_cost_per_1k=0.0025
),
Complexity.COMPLEX: RoutingDecision(
complexity=complexity,
model="gpt-4o", # or claude-opus for specific use cases
reason="Complex reasoning requires premium model",
estimated_cost_per_1k=0.015
),
}
return routing_map[complexity]
# Test the router
test_queries = [
"Is this email spam? 'Congratulations you won!'",
"Summarize this 500-word article about renewable energy",
"Analyze the trade-offs between microservices and monolith architectures for a 10-person startup",
]
for query in test_queries:
decision = route_query(query)
savings = (0.015 - decision.estimated_cost_per_1k) / 0.015 * 100
print(f"Query: {query[:60]}...")
print(f" → {decision.model} ({decision.complexity.value})")
print(f" → Reason: {decision.reason}")
print(f" → Cost savings vs premium: {savings:.0f}%")
print() Implement model routing as a strategy pattern - business logic never directly calls a specific model. Instead it calls a router that returns a model client. This makes it trivial to swap models, add new tiers, and A/B test routing strategies without touching feature code.
Multi-model routing is one of the highest-ROI investments in AI infrastructure. A well-tuned router on a production app handling 100K daily queries can cut model costs from $3,000/month to under $800/month. This is worth engineering time. Build it once, save forever.
Fallback chains create debugging nightmares. When model B answers because model A failed, always log it explicitly - which model actually responded, why the primary failed, and the latency of the fallback. Without this logging, you’ll spend hours debugging the wrong model’s behavior and wondering why quality suddenly changed.
Interview Notes: LLM Gateways and Topology
At scale, teams often put an LLM gateway between apps and model providers. The gateway centralizes routing, fallbacks, rate limiting, budget enforcement, prompt/model version logging, and provider failover.
routes:
summarize:
primary: cheap-fast-model
fallback: general-model
max_input_tokens: 12000
legal_review:
primary: strongest-reasoning-model
require_human_review: true
limits:
per_user_per_minute: 20
per_team_daily_usd: 500
Interview Practice
- Why route requests across multiple models?
- What signals can drive model routing?
- What is an LLM gateway?
- How do fallbacks affect reliability and cost?
- How should rate limits be enforced across teams?
- What can go wrong when outputs differ across fallback models?