GenAI Foundations / Beginner Track Module 3 / 9
GenAI Foundations Beginner ⏱ 20 min
DEVQABAPM

How to Use APIs to Access AI Models

Make your first AI API call. Understand the difference between OpenAI, Anthropic, and Google APIs, and learn the request/response pattern that powers every AI application.

How to Use This Lesson

  • Start with the user problem, then map the pattern to architecture and failure modes.
  • If a code or design example is included, change one assumption and reason through the impact.
  • Use role callouts, checklists, and Q&A sections as implementation or interview prep notes.

Prerequisites: 02-understanding-llms

What an API Is (Without the Jargon)

Think of a drive-through window. You pull up, say your order in a specific format (“I’ll have a number 3, no pickles”), and you receive your food in a specific container. You don’t see the kitchen. You don’t know which cook made it. You just get output in response to your input.

An API (Application Programming Interface) works the same way. You send a request in a specific format, and you get a response back. You don’t see the model weights, the GPU cluster, or the inference code. You just send text in and receive text out.

AI APIs are drive-through windows to the world’s most powerful language models.

What “REST API” Actually Means

When developers say “REST API,” they mean a standardized way to make requests over the internet using HTTP - the same protocol your browser uses to load websites.

Every AI API call is fundamentally:

  1. An HTTP POST request sent to a specific URL
  2. A JSON payload in the request body (your prompt + settings)
  3. An API key in the request headers (your authentication token)
  4. A JSON response containing the model’s output

That’s it. Whether you’re calling OpenAI, Anthropic, or Google, this pattern is identical.

Never Hardcode API Keys

API keys are passwords. Never put them directly in your code. Use environment variables (os.environ.get("OPENAI_API_KEY")) or a secrets manager. A leaked API key means someone else runs up your bill.

The Journey from Your Code to a Response

Here is exactly what happens in the roughly 1-3 seconds between your code sending a request and receiving a response:

From Code to AI Response: The Full Journey

flowchart TD
  A[Your Code] --> B[HTTP POST Request
JSON payload]
  B --> C[API Gateway
Auth + rate limiting]
  C --> D[Load Balancer
Route to GPU cluster]
  D --> E[GPU Inference
Tokens processed]
  E --> F[Response Stream
Tokens returned]
  F --> G[Your Code
Parsed JSON response]

  style A fill:#dbeafe,stroke:#2563eb,color:#1d4ed8
  style G fill:#dcfce7,stroke:#16a34a,color:#15803d
  style E fill:#f3e8ff,stroke:#7c3aed,color:#7c3aed
  style C fill:#fef3c7,stroke:#d97706,color:#b45309
Code copied! Link copied!

Steps C and D are entirely managed by the API provider. You never interact with them directly. Your job is steps A and G: form the request, parse the response.

The Universal Request Structure

Every major AI API uses this same conceptual structure, regardless of provider:

{
  "model": "gpt-4o",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant that summarizes documents concisely."
    },
    {
      "role": "user",
      "content": "Summarize the following in 3 bullet points: [your text here]"
    }
  ],
  "max_tokens": 500,
  "temperature": 0.7
}

What each field does:

  • model - Which specific model to use (different models have different costs and capabilities)
  • messages - An array of conversation turns. The system message sets the AI’s persona and rules. The user message is what you’re asking.
  • max_tokens - Maximum length of the response. Prevents runaway costs and long waits.
  • temperature - Randomness setting from 0 to 2. Near 0 = deterministic and focused. Near 1 = creative and varied. Use 0 for structured data extraction, 0.7 for writing, 0 for classification.

Provider Comparison

All three major providers expose a similar API surface. Here’s how they differ practically:

OpenAIAnthropicGoogle Gemini
API Base URLhttps://api.openai.com/v1https://api.anthropic.com/v1https://generativelanguage.googleapis.com/v1beta
Top Model Namesgpt-4o, gpt-4o-miniclaude-opus-4-5, claude-sonnet-4-5gemini-1.5-pro, gemini-1.5-flash
Python SDKopenaianthropicgoogle-generativeai
Pricing TierMid-range to highMid-range to highLow to mid-range
Context WindowUp to 128K tokensUp to 200K tokensUp to 1M tokens
Best ForBroadest ecosystem, most tutorialsLong documents, nuanced reasoningHigh volume, cost-sensitive workloads
Which Should I Use?

Start with OpenAI - it has the most documentation, tutorials, and community examples. Switch providers when you have a specific reason: cheaper at volume (Gemini), longer context (Anthropic Claude), or specific compliance requirements.

Your First API Call

Here is a complete, working Python example. It sends a system prompt and user message to OpenAI and prints the response.

First OpenAI API Call

Example code (static). Copy and run locally in your own environment.

import os
from openai import OpenAI

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

response = client.chat.completions.create(
  model="gpt-4o-mini",
  messages=[
      {"role": "system", "content": "You are a concise technical writer."},
      {"role": "user", "content": "Explain what an API is in 2 sentences."}
  ],
  max_tokens=150,
  temperature=0.5
)

print(response.choices[0].message.content)

Breaking down response.choices[0].message.content:

  • response.choices - A list of possible completions. Usually just one.
  • [0] - The first (and typically only) completion.
  • .message.content - The actual text the model generated.

The response object also contains .usage.prompt_tokens and .usage.completion_tokens - use these to track costs.

What Happens When Things Go Wrong

API calls can fail. Common errors and what they mean:

ErrorCauseFix
401 UnauthorizedInvalid or missing API keyCheck your environment variable
429 Too Many RequestsRate limit exceededImplement exponential backoff
400 Bad RequestMalformed JSON or invalid parametersCheck your request structure
500 Internal Server ErrorProvider-side issueRetry with backoff
context_length_exceededToo many tokens in the requestTruncate input or use a larger context model

Always wrap API calls in try/except blocks in production code.

🧪 For QA Engineers

What this means for testing: The API endpoint is your seam - the boundary between your code and the AI provider. For unit tests, mock the API client entirely (don’t make real calls). For integration tests, call the real API with controlled prompts and assert on response structure, not exact content. Never let your unit tests run up an API bill.

📊 For Business Analysts

What this means for cost estimation: AI APIs are priced per token (roughly per word). A typical document summary might cost $0.001-$0.01 per call. At 10,000 calls/day, that’s $10-$100/day. Get usage estimates from your dev team and factor them into your business case. The pricing pages of each provider show exact rates per model.

🎯 For Product Managers

What this means for your roadmap: Rate limits exist at every provider - typically 500-10,000 requests per minute depending on your tier. If your feature is expected to serve many concurrent users, design your UX to handle queuing and latency gracefully. A “thinking…” spinner is better than a broken interface. Also plan for the 99th percentile latency (2-10 seconds), not average latency.

What’s Next

Now that you can make an API call, the quality of your results depends almost entirely on what you put in the messages array. That’s prompt engineering - and it’s the subject of the next tutorial.

Practice Exercise

Set your OPENAI_API_KEY environment variable and run the code example above. Then change the system prompt to “You are a pirate” and observe how the response tone changes. This single experiment teaches you more about prompt engineering than most blog posts.

Interview Notes: API Controls and Reliability

Production API usage is more than sending a prompt. You should know the common controls:

ControlUse
temperatureLower for deterministic extraction, higher for creative variation.
top_pNucleus sampling; limits choices to the smallest probability mass above the threshold.
top_kSamples only from the top K likely tokens when supported.
Beam searchExplores several likely sequences; useful in some translation/search settings, less common for chat UX.
StreamingSends partial output to improve perceived latency.
RetriesUse exponential backoff for rate limits and transient provider errors.
BatchingUse for offline classification, embeddings, and eval workloads where latency is less important.
import asyncio

async def call_with_retry(client, payload, attempts=3):
    for attempt in range(attempts):
        try:
            return await client.responses.create(**payload)
        except RateLimitError:
            if attempt == attempts - 1:
                raise
            await asyncio.sleep(2 ** attempt)

payload = {
    "model": "fast-model",
    "input": "Summarize this support ticket in JSON.",
    "temperature": 0.1,
    "top_p": 0.9,
}

Interview Practice

  1. What fields should a basic AI API request include?
  2. When would you use streaming instead of waiting for the full response?
  3. How should a client handle rate limits and transient provider failures?
  4. Compare temperature, top_p, top_k, and beam search.
  5. When is batch processing better than synchronous API calls?
  6. What metadata should you log for each model request?