Why Run LLMs Locally¶

The default for most teams is API-first: call OpenAI or Anthropic and don't think about infrastructure. But there are real cases where local inference is the right answer — and knowing the difference saves both money and compliance headaches.

Learning objectives¶

Identify the use cases where local inference is justified
Understand the tradeoffs: cost, latency, quality, privacy, and engineering complexity
Recognize the hardware landscape for consumer and server-grade local inference

The case for local inference¶

1. Data privacy and compliance¶

Some data must not leave your organization:

Medical records (HIPAA) — patient data cannot be sent to third-party APIs without a BAA
Legal documents — attorney-client privileged material
Financial records — PII in contracts, tax documents, internal projections
Defense and government — classified information, ITAR-controlled data

If your use case involves any of these, the answer is local inference (or a private cloud deployment with a BAA, which costs significantly more than local).

2. Cost at scale¶

API pricing is attractive for low volume. At high volume, the economics shift:

# Cost comparison: API vs local inference for a document processing pipeline
def api_vs_local_cost(
    daily_documents: int,
    avg_tokens_per_doc: int = 2000,
    api_price_per_1m_tokens: float = 0.15,  # gpt-4o-mini input
) -> dict:

    monthly_tokens = daily_documents * avg_tokens_per_doc * 30

    api_monthly = (monthly_tokens / 1_000_000) * api_price_per_1m_tokens

    # Local inference: one-time GPU cost
    # A100 80GB can process ~500K tokens/hour (7B model, Q4)
    # Cloud A100: ~$3/hour; amortize over 24/7 operation
    gpu_hourly = 3.0
    hours_needed_per_month = (monthly_tokens / 500_000)  # 500K tokens/hour
    local_monthly = gpu_hourly * hours_needed_per_month

    return {
        "monthly_tokens": f"{monthly_tokens:,}",
        "api_monthly_usd": f"${api_monthly:.2f}",
        "local_monthly_usd": f"${local_monthly:.2f}",
        "crossover_daily_docs": int(3.0 * 500_000 / (avg_tokens_per_doc * api_price_per_1m_tokens / 1e6))
    }

result = api_vs_local_cost(daily_documents=10_000)
print(result)
# For 10K docs/day, 2K tokens each:
# monthly_tokens: 600,000,000
# api_monthly: $90.00
# local_monthly: ~$108 (cloud GPU)
# But self-owned GPU (one-time $10K) amortizes over 3+ years

3. Latency control¶

API calls involve network round-trips: 100–500ms minimum even for small requests. Local inference eliminates the network — with a GPU, time-to-first-token can be < 50ms.

This matters for: - Real-time applications (voice assistants, interactive tools) - High-frequency batch processing where latency compounds - Edge deployments (IoT, mobile, offline)

4. No rate limits¶

API rate limits cap throughput. If you need 1000 concurrent inferences, you need either an enterprise tier or your own infrastructure.

What you give up¶

┌──────────────────────────────────────────────────────┐
│  API (OpenAI, Anthropic)          │  Local inference  │
├──────────────────────────────────────────────────────┤
│  GPT-4o / Claude Sonnet quality   │  7B–13B quality   │
│  Zero setup                       │  Days of setup    │
│  Pay-as-you-go                    │  Hardware cost    │
│  Automatic updates                │  Manual updates   │
│  No hardware to manage            │  GPU maintenance  │
│  Vision, function calling, etc.   │  Varies by model  │
│  Elastic capacity                 │  Fixed capacity   │
└──────────────────────────────────────────────────────┘

The quality gap is real

The best open-source models (Llama 3.1 70B, Qwen2.5 72B) are excellent — close to GPT-4o on many benchmarks. But they require significant GPU resources (35–40 GB VRAM for Q4). The 7B–13B models that run on consumer hardware are noticeably weaker on complex reasoning tasks. Don't choose local inference if your use case requires frontier-model quality.

Hardware landscape (2025)¶

Hardware	VRAM	Max model (Q4)	Tokens/sec	Cost
Apple M3 Pro 18GB	18 GB unified	13B	25–40	$2,000
Apple M3 Max 48GB	48 GB unified	34B	30–50	$4,000
RTX 4090	24 GB	13B (or 7B float16)	80–120	$1,600
RTX 3090	24 GB	13B (Q4)	50–80	$700–900 used
RTX 4060 Ti	16 GB	13B (Q4)	40–60	$400
2× A100 80GB	160 GB	70B (float16)	60–80	$20K+ or cloud

Apple Silicon for local LLMs

Apple M-series chips use unified memory shared between CPU and GPU. An M3 Max with 96 GB can run Llama 3.1 70B in Q4 quantization at 20–30 tokens/sec. This makes MacBook Pro and Mac Studio competitive with mid-range GPU workstations for local inference.

00-agenda | 02-ollama