How to Use APIs to Access AI Models | Praveen Srinag Yellamaraju

What an API Is (Without the Jargon)

Think of a drive-through window. You pull up, say your order in a specific format (“I’ll have a number 3, no pickles”), and you receive your food in a specific container. You don’t see the kitchen. You don’t know which cook made it. You just get output in response to your input.

An API (Application Programming Interface) works the same way. You send a request in a specific format, and you get a response back. You don’t see the model weights, the GPU cluster, or the inference code. You just send text in and receive text out.

AI APIs are drive-through windows to the world’s most powerful language models.

What “REST API” Actually Means

When developers say “REST API,” they mean a standardized way to make requests over the internet using HTTP - the same protocol your browser uses to load websites.

Every AI API call is fundamentally:

An HTTP POST request sent to a specific URL
A JSON payload in the request body (your prompt + settings)
An API key in the request headers (your authentication token)
A JSON response containing the model’s output

That’s it. Whether you’re calling OpenAI, Anthropic, or Google, this pattern is identical.

Never Hardcode API Keys

API keys are passwords. Never put them directly in your code. Use environment variables (os.environ.get("OPENAI_API_KEY")) or a secrets manager. A leaked API key means someone else runs up your bill.

The Journey from Your Code to a Response

Here is exactly what happens in the roughly 1-3 seconds between your code sending a request and receiving a response:

From Code to AI Response: The Full Journey

flowchart TD
  A[Your Code] --> B[HTTP POST Request
JSON payload]
  B --> C[API Gateway
Auth + rate limiting]
  C --> D[Load Balancer
Route to GPU cluster]
  D --> E[GPU Inference
Tokens processed]
  E --> F[Response Stream
Tokens returned]
  F --> G[Your Code
Parsed JSON response]

  style A fill:#dbeafe,stroke:#2563eb,color:#1d4ed8
  style G fill:#dcfce7,stroke:#16a34a,color:#15803d
  style E fill:#f3e8ff,stroke:#7c3aed,color:#7c3aed
  style C fill:#fef3c7,stroke:#d97706,color:#b45309

Code copied! Link copied!

Steps C and D are entirely managed by the API provider. You never interact with them directly. Your job is steps A and G: form the request, parse the response.

The Universal Request Structure

Every major AI API uses this same conceptual structure, regardless of provider:

{
  "model": "gpt-4o",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant that summarizes documents concisely."
    },
    {
      "role": "user",
      "content": "Summarize the following in 3 bullet points: [your text here]"
    }
  ],
  "max_tokens": 500,
  "temperature": 0.7
}

What each field does:

model - Which specific model to use (different models have different costs and capabilities)
messages - An array of conversation turns. The system message sets the AI’s persona and rules. The user message is what you’re asking.
max_tokens - Maximum length of the response. Prevents runaway costs and long waits.
temperature - Randomness setting from 0 to 2. Near 0 = deterministic and focused. Near 1 = creative and varied. Use 0 for structured data extraction, 0.7 for writing, 0 for classification.

Provider Comparison

All three major providers expose a similar API surface. Here’s how they differ practically:

	OpenAI	Anthropic	Google Gemini
API Base URL	`https://api.openai.com/v1`	`https://api.anthropic.com/v1`	`https://generativelanguage.googleapis.com/v1beta`
Top Model Names	`gpt-4o`, `gpt-4o-mini`	`claude-opus-4-5`, `claude-sonnet-4-5`	`gemini-1.5-pro`, `gemini-1.5-flash`
Python SDK	`openai`	`anthropic`	`google-generativeai`
Pricing Tier	Mid-range to high	Mid-range to high	Low to mid-range
Context Window	Up to 128K tokens	Up to 200K tokens	Up to 1M tokens
Best For	Broadest ecosystem, most tutorials	Long documents, nuanced reasoning	High volume, cost-sensitive workloads

Which Should I Use?

Start with OpenAI - it has the most documentation, tutorials, and community examples. Switch providers when you have a specific reason: cheaper at volume (Gemini), longer context (Anthropic Claude), or specific compliance requirements.

Your First API Call

Here is a complete, working Python example. It sends a system prompt and user message to OpenAI and prints the response.

First OpenAI API Call

Example code (static). Copy and run locally in your own environment.

import os
from openai import OpenAI

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

response = client.chat.completions.create(
  model="gpt-4o-mini",
  messages=[
      {"role": "system", "content": "You are a concise technical writer."},
      {"role": "user", "content": "Explain what an API is in 2 sentences."}
  ],
  max_tokens=150,
  temperature=0.5
)

print(response.choices[0].message.content)

Breaking down response.choices[0].message.content:

response.choices - A list of possible completions. Usually just one.
[0] - The first (and typically only) completion.
.message.content - The actual text the model generated.

The response object also contains .usage.prompt_tokens and .usage.completion_tokens - use these to track costs.

What Happens When Things Go Wrong

API calls can fail. Common errors and what they mean:

Error	Cause	Fix
`401 Unauthorized`	Invalid or missing API key	Check your environment variable
`429 Too Many Requests`	Rate limit exceeded	Implement exponential backoff
`400 Bad Request`	Malformed JSON or invalid parameters	Check your request structure
`500 Internal Server Error`	Provider-side issue	Retry with backoff
`context_length_exceeded`	Too many tokens in the request	Truncate input or use a larger context model

Always wrap API calls in try/except blocks in production code.

🧪 For QA Engineers

What this means for testing: The API endpoint is your seam - the boundary between your code and the AI provider. For unit tests, mock the API client entirely (don’t make real calls). For integration tests, call the real API with controlled prompts and assert on response structure, not exact content. Never let your unit tests run up an API bill.

📊 For Business Analysts

What this means for cost estimation: AI APIs are priced per token (roughly per word). A typical document summary might cost $0.001-$0.01 per call. At 10,000 calls/day, that’s $10-$100/day. Get usage estimates from your dev team and factor them into your business case. The pricing pages of each provider show exact rates per model.

🎯 For Product Managers

What this means for your roadmap: Rate limits exist at every provider - typically 500-10,000 requests per minute depending on your tier. If your feature is expected to serve many concurrent users, design your UX to handle queuing and latency gracefully. A “thinking…” spinner is better than a broken interface. Also plan for the 99th percentile latency (2-10 seconds), not average latency.

What’s Next

Now that you can make an API call, the quality of your results depends almost entirely on what you put in the messages array. That’s prompt engineering - and it’s the subject of the next tutorial.

Practice Exercise

Set your OPENAI_API_KEY environment variable and run the code example above. Then change the system prompt to “You are a pirate” and observe how the response tone changes. This single experiment teaches you more about prompt engineering than most blog posts.

Interview Notes: API Controls and Reliability

Production API usage is more than sending a prompt. You should know the common controls:

Control	Use
`temperature`	Lower for deterministic extraction, higher for creative variation.
`top_p`	Nucleus sampling; limits choices to the smallest probability mass above the threshold.
`top_k`	Samples only from the top K likely tokens when supported.
Beam search	Explores several likely sequences; useful in some translation/search settings, less common for chat UX.
Streaming	Sends partial output to improve perceived latency.
Retries	Use exponential backoff for rate limits and transient provider errors.
Batching	Use for offline classification, embeddings, and eval workloads where latency is less important.

import asyncio

async def call_with_retry(client, payload, attempts=3):
    for attempt in range(attempts):
        try:
            return await client.responses.create(**payload)
        except RateLimitError:
            if attempt == attempts - 1:
                raise
            await asyncio.sleep(2 ** attempt)

payload = {
    "model": "fast-model",
    "input": "Summarize this support ticket in JSON.",
    "temperature": 0.1,
    "top_p": 0.9,
}

Interview Practice

What fields should a basic AI API request include?
When would you use streaming instead of waiting for the full response?
How should a client handle rate limits and transient provider failures?
Compare temperature, top_p, top_k, and beam search.
When is batch processing better than synchronous API calls?
What metadata should you log for each model request?

How to Use This Lesson

Related Blog Deep Dives

What an API Is (Without the Jargon)

What “REST API” Actually Means

The Journey from Your Code to a Response

From Code to AI Response: The Full Journey

The Universal Request Structure

Provider Comparison

Your First API Call

First OpenAI API Call

What Happens When Things Go Wrong

What’s Next

Interview Notes: API Controls and Reliability

Interview Practice