GenAI Foundations / Intermediate Track Module 7 / 8
GenAI Foundations Intermediate ⏱ 25 min
DEVPM

Multi-Model Strategies: Routing, Fallbacks, and Cost Tiers

Not every task needs GPT-4. Route simple queries to cheap models, complex ones to powerful models, and build fallback chains that survive model outages.

How to Use This Lesson

  • Start with the user problem, then map the pattern to architecture and failure modes.
  • If a code or design example is included, change one assumption and reason through the impact.
  • Use role callouts, checklists, and Q&A sections as implementation or interview prep notes.

Prerequisites: beginner/03-using-ai-apis

The Cost Case for Multi-Model

Using GPT-4 for every request is like hiring a senior architect to answer “what’s 2+2?” - expensive and unnecessary.

Real-world AI cost breakdown for a typical application:

  • 70% of queries: simple classification, short extraction, yes/no decisions → cheap model
  • 20% of queries: moderate complexity, multi-step reasoning → mid-tier model
  • 10% of queries: complex reasoning, nuanced judgment, long-form → premium model

Routing intelligently can cut AI costs 60-80% with minimal quality impact.

Three Routing Strategies

1. Complexity-based routing - classify the query before sending it 2. Task-based routing - route by task type (summarization → cheap, reasoning → premium) 3. Cascade routing - try cheap model first, escalate if confidence is low

Model Routing Decision Tree

flowchart TD
  Q[Incoming Query] --> CL[Classify Complexity]
  CL --> S{Simple?
Classification
Extraction
Yes/No}
  CL --> M{Moderate?
Summarization
Translation
Formatting}
  CL --> C{Complex?
Multi-step reasoning
Code generation
Nuanced judgment}
  S --> M1["gpt-4o-mini
~$0.15/1M tokens"]
  M --> M2["gpt-4o
~$2.50/1M tokens"]
  C --> M3["gpt-4o or claude-opus
~$15/1M tokens"]
  style M1 fill:#dcfce7,stroke:#16a34a,color:#15803d
  style M2 fill:#fef3c7,stroke:#d97706,color:#b45309
  style M3 fill:#fee2e2,stroke:#dc2626,color:#dc2626
Code copied! Link copied!

Fallback Chain Architecture

Single-model systems fail when the provider has an outage. Fallback chains maintain availability:

Fallback Chain Architecture

flowchart LR
  R[Request] --> P[Primary
GPT-4o]
  P -->|success| OUT[Response]
  P -->|timeout / error| FB1[Fallback 1
Claude Sonnet]
  FB1 -->|success| OUT
  FB1 -->|error| FB2[Fallback 2
Gemini Pro]
  FB2 -->|success| OUT
  FB2 -->|error| ERR[Error Response
+ alert]
Code copied! Link copied!

Working Router Implementation

Query Complexity Classifier + Model Router

Example code (static). Copy and run locally in your own environment.

from dataclasses import dataclass
from enum import Enum

class Complexity(Enum):
  SIMPLE = "simple"
  MODERATE = "moderate"  
  COMPLEX = "complex"

@dataclass
class RoutingDecision:
  complexity: Complexity
  model: str
  reason: str
  estimated_cost_per_1k: float

def classify_complexity(query: str) -> Complexity:
  """Simple heuristic-based classifier (replace with ML model in production)."""
  query_lower = query.lower()
  
  # Simple signals
  simple_patterns = ["is this", "does this", "yes or no", "classify", "extract the", "what is the"]
  if any(p in query_lower for p in simple_patterns) and len(query.split()) < 20:
      return Complexity.SIMPLE
  
  # Complex signals  
  complex_patterns = ["analyze", "compare", "explain why", "design", "write a", "reasoning", "evaluate"]
  word_count = len(query.split())
  if any(p in query_lower for p in complex_patterns) or word_count > 80:
      return Complexity.COMPLEX
  
  return Complexity.MODERATE

def route_query(query: str) -> RoutingDecision:
  """Route a query to the appropriate model tier."""
  complexity = classify_complexity(query)
  
  routing_map = {
      Complexity.SIMPLE: RoutingDecision(
          complexity=complexity,
          model="gpt-4o-mini",
          reason="Simple classification/extraction task",
          estimated_cost_per_1k=0.00015
      ),
      Complexity.MODERATE: RoutingDecision(
          complexity=complexity,
          model="gpt-4o",
          reason="Moderate complexity, standard model sufficient",
          estimated_cost_per_1k=0.0025
      ),
      Complexity.COMPLEX: RoutingDecision(
          complexity=complexity,
          model="gpt-4o",  # or claude-opus for specific use cases
          reason="Complex reasoning requires premium model",
          estimated_cost_per_1k=0.015
      ),
  }
  return routing_map[complexity]

# Test the router
test_queries = [
  "Is this email spam? 'Congratulations you won!'",
  "Summarize this 500-word article about renewable energy",
  "Analyze the trade-offs between microservices and monolith architectures for a 10-person startup",
]

for query in test_queries:
  decision = route_query(query)
  savings = (0.015 - decision.estimated_cost_per_1k) / 0.015 * 100
  print(f"Query: {query[:60]}...")
  print(f"  → {decision.model} ({decision.complexity.value})")
  print(f"  → Reason: {decision.reason}")
  print(f"  → Cost savings vs premium: {savings:.0f}%")
  print()
⚙️ For Developers

Implement model routing as a strategy pattern - business logic never directly calls a specific model. Instead it calls a router that returns a model client. This makes it trivial to swap models, add new tiers, and A/B test routing strategies without touching feature code.

🎯 For Product Managers

Multi-model routing is one of the highest-ROI investments in AI infrastructure. A well-tuned router on a production app handling 100K daily queries can cut model costs from $3,000/month to under $800/month. This is worth engineering time. Build it once, save forever.

Production Gotcha

Fallback chains create debugging nightmares. When model B answers because model A failed, always log it explicitly - which model actually responded, why the primary failed, and the latency of the fallback. Without this logging, you’ll spend hours debugging the wrong model’s behavior and wondering why quality suddenly changed.

Interview Notes: LLM Gateways and Topology

At scale, teams often put an LLM gateway between apps and model providers. The gateway centralizes routing, fallbacks, rate limiting, budget enforcement, prompt/model version logging, and provider failover.

routes:
  summarize:
    primary: cheap-fast-model
    fallback: general-model
    max_input_tokens: 12000
  legal_review:
    primary: strongest-reasoning-model
    require_human_review: true
limits:
  per_user_per_minute: 20
  per_team_daily_usd: 500

Interview Practice

  1. Why route requests across multiple models?
  2. What signals can drive model routing?
  3. What is an LLM gateway?
  4. How do fallbacks affect reliability and cost?
  5. How should rate limits be enforced across teams?
  6. What can go wrong when outputs differ across fallback models?