Ollama¶
Ollama is the fastest path to running LLMs locally. Install it, pull a model, and you have a local OpenAI-compatible API endpoint in under five minutes. It handles GGUF model downloading, GPU detection, context management, and REST API serving — all automatically.
Learning objectives¶
- Install Ollama and pull models from the library
- Use the Ollama REST API and Python
ollamaclient - Drop in Ollama as an OpenAI API replacement using
base_url - Configure model parameters and run models efficiently
Installation and first run¶
# macOS
brew install ollama
# or: curl -fsSL https://ollama.com/install.sh | sh
# Windows (download installer from https://ollama.com)
# Linux
curl -fsSL https://ollama.com/install.sh | sh
# Start the Ollama server
ollama serve # runs on http://localhost:11434
# In a new terminal: pull and run a model
ollama pull llama3.2:3b # 2.0 GB — fast to download, runs on 8GB RAM
ollama pull llama3.1:8b # 4.7 GB — best 8B model available
ollama pull mistral:7b # 4.1 GB — fast, good for coding
ollama pull nomic-embed-text # 274 MB — embeddings
# Interactive chat
ollama run llama3.1:8b "Explain attention mechanisms in 3 sentences."
Python client¶
import ollama
# Simple generation
response = ollama.chat(
model="llama3.1:8b",
messages=[
{"role": "system", "content": "You are a concise technical assistant."},
{"role": "user", "content": "What is the difference between RAG and fine-tuning?"}
]
)
print(response["message"]["content"])
print(f"\nTokens: {response['eval_count']} tokens, {response['eval_duration']/1e9:.1f}s")
# Streaming
for chunk in ollama.chat(
model="llama3.1:8b",
messages=[{"role": "user", "content": "Write a haiku about vector databases."}],
stream=True
):
print(chunk["message"]["content"], end="", flush=True)
print()
OpenAI-compatible API¶
Ollama exposes /v1/chat/completions — the same endpoint as OpenAI. You can use the OpenAI Python client with base_url pointing to your local Ollama.
from openai import OpenAI
# Point to local Ollama — works with any OpenAI SDK code
local_client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama" # Any non-empty string — Ollama ignores it
)
response = local_client.chat.completions.create(
model="llama3.1:8b",
messages=[
{"role": "system", "content": "You answer questions about machine learning."},
{"role": "user", "content": "Explain cosine similarity in 2 sentences."}
],
temperature=0.0,
max_tokens=200
)
print(response.choices[0].message.content)
This means any code written for OpenAI works with local Ollama by changing two lines. Useful for: - Testing locally before paying for API calls - Switching between local (dev) and API (prod) via environment variable - Cost-free iteration on prompts and pipelines
import os
from openai import OpenAI
# Switch between local and API via environment variable
USE_LOCAL = os.getenv("USE_LOCAL_LLM", "false").lower() == "true"
if USE_LOCAL:
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
model = "llama3.1:8b"
else:
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
model = "gpt-4o-mini"
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": "Hello, what can you do?"}],
max_tokens=100
)
print(f"[{model}] {response.choices[0].message.content}")
Embeddings with Ollama¶
import ollama
import numpy as np
def embed_local(texts: list[str], model: str = "nomic-embed-text") -> np.ndarray:
"""Embed texts using a local Ollama model."""
embeddings = []
for text in texts:
response = ollama.embeddings(model=model, prompt=text)
embeddings.append(response["embedding"])
return np.array(embeddings)
# Usage
texts = [
"How does vector search work?",
"What is approximate nearest neighbor?",
"The stock market rose 2% today.",
]
embs = embed_local(texts)
embs_normalized = embs / np.linalg.norm(embs, axis=1, keepdims=True)
sim = embs_normalized @ embs_normalized.T
print(f"Dimension: {embs.shape[1]}")
print(f"Q0 ↔ Q1 similarity: {sim[0,1]:.3f}") # Should be high (both about vector search)
print(f"Q0 ↔ Q2 similarity: {sim[0,2]:.3f}") # Should be low (off-topic)
Modelfile: customizing model behavior¶
Ollama's Modelfile lets you create custom model variants with preset parameters and system prompts.
# Modelfile — save as ./Modelfile
FROM llama3.1:8b
# Set the temperature and context window
PARAMETER temperature 0.0
PARAMETER num_ctx 8192
PARAMETER top_k 40
PARAMETER top_p 0.9
# Custom system prompt baked in
SYSTEM """You are a RAG evaluation assistant. Your job is to analyze whether
LLM answers are faithful to the provided context.
Rules:
1. Score faithfulness from 0.0 to 1.0.
2. List any claims not supported by context.
3. Never add information from outside the context.
4. Return structured JSON."""
# Build and test the custom model
ollama create rag-evaluator -f ./Modelfile
ollama run rag-evaluator "Evaluate: Context: 'Python was released in 1991.' Answer: 'Python was released in 1991 by a team at MIT.'"
# Use the custom model in Python
response = ollama.chat(
model="rag-evaluator",
messages=[{
"role": "user",
"content": "Context: 'The refund policy allows returns within 14 days.'\nAnswer: 'Returns are accepted within 14 days of purchase.'"
}]
)
print(response["message"]["content"])
Model management¶
# List downloaded models and their sizes
ollama list
# Remove a model to free disk space
ollama rm llama3.2:3b
# Show model metadata
ollama show llama3.1:8b
# Pull a specific quantization version
ollama pull llama3.1:8b-instruct-q4_K_M # Q4 K-mean Medium — recommended
ollama pull llama3.1:8b-instruct-q8_0 # Q8 — higher quality, 2× larger
ollama pull llama3.1:8b-instruct-fp16 # Full precision — requires 16+ GB VRAM
Recommended quantizations
For most use cases: q4_K_M — best balance of quality and size. It uses K-means quantization for better accuracy than simple q4_0. Only step up to q8_0 if you're doing tasks where every percentage point of accuracy matters and have the VRAM.