Skip to content

Cost and Rate Limits

Ignoring cost and rate limits is the fastest way to ship a product that hemorrhages money or breaks under load. This section gives you the numbers and the patterns to handle both.

Learning objectives

  • Calculate token costs for real workloads
  • Implement exponential backoff retry logic
  • Use async concurrency within rate limits
  • Apply prompt caching and model routing to reduce spend

2025 pricing reference

Prices as of May 2025 — always verify at platform.openai.com/pricing and anthropic.com/pricing.

OpenAI

Model Input (per 1M tokens) Output (per 1M tokens) Notes
gpt-4o $2.50 $10.00 Cached input: $1.25
gpt-4o-mini $0.15 $0.60 Best value for simple tasks
o4-mini $1.10 $4.40 + reasoning tokens at $1.10/1M
o3 $10.00 $40.00 Highest capability

Anthropic

Model Input (per 1M tokens) Output (per 1M tokens) Cache write Cache read
claude-opus-4-7 $15.00 $75.00 $18.75 $1.50
claude-sonnet-4-6 $3.00 $15.00 $3.75 $0.30
claude-haiku-4-5 $0.80 $4.00 $1.00 $0.08

Cost calculator

import os
from dataclasses import dataclass
from openai import OpenAI

@dataclass
class TokenCost:
    input_tokens: int
    output_tokens: int
    model: str

    PRICING = {
        "gpt-4o": {"input": 2.50, "output": 10.00},
        "gpt-4o-mini": {"input": 0.15, "output": 0.60},
        "o4-mini": {"input": 1.10, "output": 4.40},
        "claude-sonnet-4-6": {"input": 3.00, "output": 15.00},
        "claude-haiku-4-5-20251001": {"input": 0.80, "output": 4.00},
    }

    @property
    def cost_usd(self) -> float:
        p = self.PRICING.get(self.model, {"input": 0, "output": 0})
        return (self.input_tokens * p["input"] + self.output_tokens * p["output"]) / 1_000_000

    def __str__(self) -> str:
        return (
            f"Model: {self.model}\n"
            f"  Input:  {self.input_tokens:,} tokens\n"
            f"  Output: {self.output_tokens:,} tokens\n"
            f"  Cost:   ${self.cost_usd:.6f}"
        )

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Explain transformers in 100 words."}],
    max_tokens=150
)

cost = TokenCost(
    input_tokens=response.usage.prompt_tokens,
    output_tokens=response.usage.completion_tokens,
    model="gpt-4o"
)
print(cost)

Monthly cost projection

def project_monthly_cost(
    calls_per_day: int,
    avg_input_tokens: int,
    avg_output_tokens: int,
    model: str = "gpt-4o"
) -> dict:
    PRICING = {
        "gpt-4o": (2.50, 10.00),
        "gpt-4o-mini": (0.15, 0.60),
        "claude-sonnet-4-6": (3.00, 15.00),
        "claude-haiku-4-5-20251001": (0.80, 4.00),
    }
    input_price, output_price = PRICING[model]

    monthly_calls = calls_per_day * 30
    monthly_input = monthly_calls * avg_input_tokens
    monthly_output = monthly_calls * avg_output_tokens
    monthly_cost = (monthly_input * input_price + monthly_output * output_price) / 1_000_000

    return {
        "model": model,
        "monthly_calls": monthly_calls,
        "monthly_input_tokens": monthly_input,
        "monthly_output_tokens": monthly_output,
        "monthly_cost_usd": monthly_cost
    }

# Example: 10K daily calls, 800 input tokens, 200 output tokens
for model in ["gpt-4o", "gpt-4o-mini", "claude-sonnet-4-6", "claude-haiku-4-5-20251001"]:
    result = project_monthly_cost(10_000, 800, 200, model)
    print(f"{model:35s}: ${result['monthly_cost_usd']:,.2f}/month")

# Output (approximate):
# gpt-4o                             : $660.00/month
# gpt-4o-mini                        : $39.00/month
# claude-sonnet-4-6                  : $810.00/month
# claude-haiku-4-5-20251001          : $216.00/month

Output tokens cost 4–5× more than input

Most engineers focus on system prompt size (input tokens) but forget that a verbose model response is far more expensive. Set max_tokens to the minimum needed, and instruct the model to be concise when output length isn't critical.


Rate limits

OpenAI and Anthropic both limit requests per minute (RPM) and tokens per minute (TPM). Default limits for new accounts:

Provider Tier 1 RPM Tier 1 TPM
OpenAI (gpt-4o) 500 30,000
OpenAI (gpt-4o-mini) 500 200,000
Anthropic (Sonnet) 50 40,000

Limits increase automatically as you spend more. Check your actual limits: - OpenAI: platform.openai.com/account/limits - Anthropic: console.anthropic.com/settings/limits


Retry with exponential backoff

import time
import random
from openai import RateLimitError, APITimeoutError, APIConnectionError, APIStatusError

def with_retry(fn, max_retries: int = 5, base_delay: float = 1.0):
    """Decorator-style retry wrapper with exponential backoff + jitter."""
    for attempt in range(max_retries):
        try:
            return fn()

        except RateLimitError as e:
            if attempt == max_retries - 1:
                raise
            retry_after = float(e.response.headers.get("retry-after", base_delay * (2 ** attempt)))
            jitter = random.uniform(0, 0.5)
            wait = retry_after + jitter
            print(f"Rate limit — waiting {wait:.1f}s (attempt {attempt + 1}/{max_retries})")
            time.sleep(wait)

        except APITimeoutError:
            if attempt == max_retries - 1:
                raise
            wait = base_delay * (2 ** attempt) + random.uniform(0, 0.5)
            print(f"Timeout — waiting {wait:.1f}s (attempt {attempt + 1}/{max_retries})")
            time.sleep(wait)

        except APIStatusError as e:
            if e.status_code in {500, 502, 503, 529}:  # server errors
                if attempt == max_retries - 1:
                    raise
                wait = base_delay * (2 ** attempt)
                print(f"Server error {e.status_code} — waiting {wait:.1f}s")
                time.sleep(wait)
            else:
                raise  # 400, 401, 404 — don't retry

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

result = with_retry(lambda: client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello"}],
    max_tokens=10
))
print(result.choices[0].message.content)

Async batch processing within rate limits

import asyncio
from openai import AsyncOpenAI

async_client = AsyncOpenAI(api_key=os.getenv("OPENAI_API_KEY"))

async def process_one(text: str, sem: asyncio.Semaphore, model: str = "gpt-4o-mini") -> dict:
    async with sem:
        try:
            response = await async_client.chat.completions.create(
                model=model,
                messages=[
                    {"role": "system", "content": "Classify sentiment as positive, negative, or neutral. Respond with one word."},
                    {"role": "user", "content": text}
                ],
                max_tokens=5,
                temperature=0.0
            )
            return {
                "text": text,
                "sentiment": response.choices[0].message.content.strip().lower(),
                "tokens": response.usage.total_tokens
            }
        except Exception as e:
            return {"text": text, "error": str(e)}

async def batch_sentiment(texts: list[str], concurrency: int = 10) -> list[dict]:
    sem = asyncio.Semaphore(concurrency)
    tasks = [process_one(t, sem) for t in texts]
    results = await asyncio.gather(*tasks, return_exceptions=False)
    total_tokens = sum(r.get("tokens", 0) for r in results)
    print(f"Processed {len(results)} texts, {total_tokens:,} total tokens")
    return results

# Process 100 reviews concurrently (10 at a time)
reviews = [f"Review text number {i}" for i in range(100)]
results = asyncio.run(batch_sentiment(reviews))

Cost reduction strategies

1. Model routing — use cheap models for simple tasks

def route_request(user_message: str) -> str:
    # Classify complexity with a cheap model first
    classification = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "user",
            "content": f"Is this a simple factual question or a complex reasoning task? Answer 'simple' or 'complex'.\n\nQuestion: {user_message}"
        }],
        max_tokens=5,
        temperature=0.0
    ).choices[0].message.content.strip().lower()

    model = "gpt-4o-mini" if classification == "simple" else "gpt-4o"

    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": user_message}],
        max_tokens=500
    )
    print(f"Routed to: {model}")
    return response.choices[0].message.content

2. Response caching for repeated queries

import hashlib
from functools import lru_cache

def cache_key(model: str, messages: list[dict]) -> str:
    content = f"{model}:{messages}"
    return hashlib.sha256(content.encode()).hexdigest()

_response_cache: dict[str, str] = {}

def cached_completion(model: str, messages: list[dict], max_tokens: int = 500) -> str:
    key = cache_key(model, messages)
    if key in _response_cache:
        print("Cache hit — no API call")
        return _response_cache[key]

    response = client.chat.completions.create(
        model=model,
        messages=messages,
        max_tokens=max_tokens,
        temperature=0.0  # deterministic = safe to cache
    )
    result = response.choices[0].message.content
    _response_cache[key] = result
    return result

3. Batch API for non-urgent workloads

OpenAI's Batch API processes requests asynchronously at 50% discount with 24h turnaround — ideal for evaluation pipelines, bulk annotation, and offline analysis.

import json

# Prepare batch file
requests = [
    {
        "custom_id": f"request-{i}",
        "method": "POST",
        "url": "/v1/chat/completions",
        "body": {
            "model": "gpt-4o-mini",
            "messages": [{"role": "user", "content": f"Summarize: document {i}"}],
            "max_tokens": 100
        }
    }
    for i in range(50)
]

# Write JSONL batch file
with open("batch_requests.jsonl", "w") as f:
    for req in requests:
        f.write(json.dumps(req) + "\n")

# Upload and submit
with open("batch_requests.jsonl", "rb") as f:
    batch_file = client.files.create(file=f, purpose="batch")

batch = client.batches.create(
    input_file_id=batch_file.id,
    endpoint="/v1/chat/completions",
    completion_window="24h"
)
print(f"Batch ID: {batch.id}, Status: {batch.status}")
# Poll client.batches.retrieve(batch.id) until status == "completed"

Common mistakes

Forgetting max_tokens allows runaway costs

Without max_tokens, the model can generate up to its context window limit. On a 128K model, an unconstrained generation could produce 100K+ tokens at $10/1M output tokens. Always set max_tokens.

Retrying 400 errors wastes money

400 errors (invalid request) will never succeed on retry. Only retry 429 (rate limit), 500/502/503 (server errors), and timeouts. Retrying 400s burns API quota and delays failure detection.

Token counting != word counting

"My prompt is 500 words" does not mean 500 tokens. English text is roughly 0.75 tokens per word, but code, JSON, and non-English text can be 2–4 tokens per word. Always count tokens directly.


04-anthropic-messages-api | 06-practice-exercises