Skip to content

Ollama

Ollama is the fastest path to running LLMs locally. Install it, pull a model, and you have a local OpenAI-compatible API endpoint in under five minutes. It handles GGUF model downloading, GPU detection, context management, and REST API serving — all automatically.

Learning objectives

  • Install Ollama and pull models from the library
  • Use the Ollama REST API and Python ollama client
  • Drop in Ollama as an OpenAI API replacement using base_url
  • Configure model parameters and run models efficiently

Installation and first run

# macOS
brew install ollama
# or: curl -fsSL https://ollama.com/install.sh | sh

# Windows (download installer from https://ollama.com)
# Linux
curl -fsSL https://ollama.com/install.sh | sh

# Start the Ollama server
ollama serve   # runs on http://localhost:11434

# In a new terminal: pull and run a model
ollama pull llama3.2:3b          # 2.0 GB — fast to download, runs on 8GB RAM
ollama pull llama3.1:8b          # 4.7 GB — best 8B model available
ollama pull mistral:7b           # 4.1 GB — fast, good for coding
ollama pull nomic-embed-text     # 274 MB — embeddings

# Interactive chat
ollama run llama3.1:8b "Explain attention mechanisms in 3 sentences."

Python client

import ollama

# Simple generation
response = ollama.chat(
    model="llama3.1:8b",
    messages=[
        {"role": "system", "content": "You are a concise technical assistant."},
        {"role": "user", "content": "What is the difference between RAG and fine-tuning?"}
    ]
)
print(response["message"]["content"])
print(f"\nTokens: {response['eval_count']} tokens, {response['eval_duration']/1e9:.1f}s")

# Streaming
for chunk in ollama.chat(
    model="llama3.1:8b",
    messages=[{"role": "user", "content": "Write a haiku about vector databases."}],
    stream=True
):
    print(chunk["message"]["content"], end="", flush=True)
print()

OpenAI-compatible API

Ollama exposes /v1/chat/completions — the same endpoint as OpenAI. You can use the OpenAI Python client with base_url pointing to your local Ollama.

from openai import OpenAI

# Point to local Ollama — works with any OpenAI SDK code
local_client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # Any non-empty string — Ollama ignores it
)

response = local_client.chat.completions.create(
    model="llama3.1:8b",
    messages=[
        {"role": "system", "content": "You answer questions about machine learning."},
        {"role": "user", "content": "Explain cosine similarity in 2 sentences."}
    ],
    temperature=0.0,
    max_tokens=200
)
print(response.choices[0].message.content)

This means any code written for OpenAI works with local Ollama by changing two lines. Useful for: - Testing locally before paying for API calls - Switching between local (dev) and API (prod) via environment variable - Cost-free iteration on prompts and pipelines

import os
from openai import OpenAI

# Switch between local and API via environment variable
USE_LOCAL = os.getenv("USE_LOCAL_LLM", "false").lower() == "true"

if USE_LOCAL:
    client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
    model = "llama3.1:8b"
else:
    client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
    model = "gpt-4o-mini"

response = client.chat.completions.create(
    model=model,
    messages=[{"role": "user", "content": "Hello, what can you do?"}],
    max_tokens=100
)
print(f"[{model}] {response.choices[0].message.content}")

Embeddings with Ollama

import ollama
import numpy as np

def embed_local(texts: list[str], model: str = "nomic-embed-text") -> np.ndarray:
    """Embed texts using a local Ollama model."""
    embeddings = []
    for text in texts:
        response = ollama.embeddings(model=model, prompt=text)
        embeddings.append(response["embedding"])
    return np.array(embeddings)

# Usage
texts = [
    "How does vector search work?",
    "What is approximate nearest neighbor?",
    "The stock market rose 2% today.",
]

embs = embed_local(texts)
embs_normalized = embs / np.linalg.norm(embs, axis=1, keepdims=True)
sim = embs_normalized @ embs_normalized.T

print(f"Dimension: {embs.shape[1]}")
print(f"Q0 ↔ Q1 similarity: {sim[0,1]:.3f}")   # Should be high (both about vector search)
print(f"Q0 ↔ Q2 similarity: {sim[0,2]:.3f}")   # Should be low (off-topic)

Modelfile: customizing model behavior

Ollama's Modelfile lets you create custom model variants with preset parameters and system prompts.

# Modelfile — save as ./Modelfile
FROM llama3.1:8b

# Set the temperature and context window
PARAMETER temperature 0.0
PARAMETER num_ctx 8192
PARAMETER top_k 40
PARAMETER top_p 0.9

# Custom system prompt baked in
SYSTEM """You are a RAG evaluation assistant. Your job is to analyze whether
LLM answers are faithful to the provided context.

Rules:
1. Score faithfulness from 0.0 to 1.0.
2. List any claims not supported by context.
3. Never add information from outside the context.
4. Return structured JSON."""
# Build and test the custom model
ollama create rag-evaluator -f ./Modelfile
ollama run rag-evaluator "Evaluate: Context: 'Python was released in 1991.' Answer: 'Python was released in 1991 by a team at MIT.'"
# Use the custom model in Python
response = ollama.chat(
    model="rag-evaluator",
    messages=[{
        "role": "user",
        "content": "Context: 'The refund policy allows returns within 14 days.'\nAnswer: 'Returns are accepted within 14 days of purchase.'"
    }]
)
print(response["message"]["content"])

Model management

# List downloaded models and their sizes
ollama list

# Remove a model to free disk space
ollama rm llama3.2:3b

# Show model metadata
ollama show llama3.1:8b

# Pull a specific quantization version
ollama pull llama3.1:8b-instruct-q4_K_M   # Q4 K-mean Medium — recommended
ollama pull llama3.1:8b-instruct-q8_0     # Q8 — higher quality, 2× larger
ollama pull llama3.1:8b-instruct-fp16     # Full precision — requires 16+ GB VRAM

Recommended quantizations

For most use cases: q4_K_M — best balance of quality and size. It uses K-means quantization for better accuracy than simple q4_0. Only step up to q8_0 if you're doing tasks where every percentage point of accuracy matters and have the VRAM.


01-why-run-locally | 03-llama-cpp