llama.cpp¶

llama.cpp is a C++ implementation of LLM inference, optimized for CPU (and GPU via CUDA/Metal). It introduced the GGUF format — the most widely supported local model format — and remains the fastest pure-CPU LLM inference engine. Ollama uses llama.cpp under the hood.

Learning objectives¶

Understand GGUF format and how quantization levels are named
Install llama-cpp-python and run inference without Ollama
Integrate llama.cpp with LangChain for drop-in local LLM use
Configure GPU offloading for mixed CPU/GPU inference

GGUF format¶

GGUF (GPT-Generated Unified Format) is the file format used by llama.cpp. It stores model weights, tokenizer data, and metadata in a single self-contained file.

model.gguf
├── metadata          ← model architecture, context length, vocab size
├── tokenizer         ← vocabulary, special tokens, BPE merges
└── tensors           ← quantized weights (Q4_K_M, Q8_0, F16, etc.)

Naming convention:

Llama-3.1-8B-Instruct.Q4_K_M.gguf
│              │        │
│              │        └── Quantization: Q4 K-mean Medium
│              └── Model size (8 billion parameters)
└── Model family and variant

Where to find GGUF files: Search huggingface.co for a model name + "GGUF". Most popular models have GGUF versions uploaded by the community (e.g., bartowski/Meta-Llama-3.1-8B-Instruct-GGUF).

Installation¶

# CPU only (works everywhere)
pip install llama-cpp-python

# With CUDA GPU support (requires CUDA toolkit)
CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --force-reinstall --no-cache-dir

# With Metal GPU support (macOS Apple Silicon)
CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python --force-reinstall --no-cache-dir

Basic inference¶

from llama_cpp import Llama

# Load a GGUF model
llm = Llama(
    model_path="./models/Llama-3.1-8B-Instruct.Q4_K_M.gguf",
    n_ctx=4096,          # Context window size
    n_threads=8,         # CPU threads (set to physical core count)
    n_gpu_layers=0,      # 0 = CPU only; set to 35 for full GPU offload
    verbose=False        # Suppress loading logs
)

# Simple completion
output = llm(
    prompt="Q: What is retrieval-augmented generation?\nA:",
    max_tokens=200,
    stop=["Q:", "\n\n"],
    echo=False           # Don't include prompt in output
)
print(output["choices"][0]["text"])

Chat completions with chat templates¶

from llama_cpp import Llama

llm = Llama(
    model_path="./models/Llama-3.1-8B-Instruct.Q4_K_M.gguf",
    n_ctx=4096,
    n_gpu_layers=35,     # Offload 35 layers to GPU (adjust based on your VRAM)
    chat_format="llama-3",  # Applies the correct chat template
    verbose=False
)

# OpenAI-compatible chat completions
response = llm.create_chat_completion(
    messages=[
        {"role": "system", "content": "You are a concise coding assistant."},
        {"role": "user", "content": "Write a Python function to compute the nth Fibonacci number."}
    ],
    max_tokens=300,
    temperature=0.0,
    stop=["<|eot_id|>"]
)
print(response["choices"][0]["message"]["content"])

GPU layer offloading¶

A key feature of llama.cpp is partial GPU offloading. If you have a GPU with less VRAM than the full model requires, you can offload only some layers to GPU and run the rest on CPU.

import subprocess

def get_model_layer_count(model_path: str) -> int:
    """Estimate number of transformer layers from model size (approximation)."""
    import os
    size_gb = os.path.getsize(model_path) / 1e9
    # Rough heuristic: ~1 layer per 0.1 GB for Q4 7B-class models
    return int(size_gb * 4)

def optimal_gpu_layers(model_path: str, vram_gb: float) -> int:
    """Estimate optimal n_gpu_layers for available VRAM."""
    import os
    model_size_gb = os.path.getsize(model_path) / 1e9
    # Reserve 2 GB for KV cache and other overhead
    available_for_model = max(0, vram_gb - 2.0)
    fraction = available_for_model / model_size_gb
    total_layers = get_model_layer_count(model_path)
    return int(total_layers * min(fraction, 1.0))

# Example: 7B Q4 model (4.1 GB) on a GPU with 6 GB VRAM
# optimal_layers ≈ (6-2)/4.1 * 32 ≈ 31 layers on GPU, rest on CPU

from llama_cpp import Llama
import time

def benchmark_generation(llm: Llama, prompt: str, n_tokens: int = 100) -> dict:
    start = time.perf_counter()
    output = llm(prompt, max_tokens=n_tokens, echo=False)
    elapsed = time.perf_counter() - start
    generated = len(output["choices"][0]["text"].split())

    return {
        "tokens_per_second": n_tokens / elapsed,
        "elapsed_sec": elapsed,
        "output": output["choices"][0]["text"][:100]
    }

# Compare: all CPU vs partial GPU offload
llm_cpu = Llama(model_path="./models/Llama-3.1-8B-Instruct.Q4_K_M.gguf",
                n_ctx=2048, n_gpu_layers=0, verbose=False)
llm_gpu = Llama(model_path="./models/Llama-3.1-8B-Instruct.Q4_K_M.gguf",
                n_ctx=2048, n_gpu_layers=20, verbose=False)

prompt = "Explain gradient descent in simple terms:"

cpu_bench = benchmark_generation(llm_cpu, prompt)
gpu_bench = benchmark_generation(llm_gpu, prompt)

print(f"CPU only: {cpu_bench['tokens_per_second']:.1f} tok/s")
print(f"GPU (20 layers): {gpu_bench['tokens_per_second']:.1f} tok/s")
print(f"Speedup: {gpu_bench['tokens_per_second']/cpu_bench['tokens_per_second']:.1f}×")

Streaming with llama.cpp¶

from llama_cpp import Llama

llm = Llama(
    model_path="./models/Llama-3.1-8B-Instruct.Q4_K_M.gguf",
    n_ctx=2048,
    n_gpu_layers=35,
    verbose=False
)

# Stream tokens as they generate
print("Streaming response:")
for chunk in llm.create_chat_completion(
    messages=[{"role": "user", "content": "List 5 best practices for RAG pipeline design."}],
    max_tokens=400,
    temperature=0.7,
    stream=True
):
    delta = chunk["choices"][0].get("delta", {})
    if "content" in delta:
        print(delta["content"], end="", flush=True)
print()

Using llama.cpp with LangChain¶

from langchain_community.llms import LlamaCpp
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.prompts import PromptTemplate

# LlamaCpp is a LangChain LLM wrapper
llm = LlamaCpp(
    model_path="./models/Llama-3.1-8B-Instruct.Q4_K_M.gguf",
    n_gpu_layers=35,
    n_ctx=4096,
    n_threads=8,
    temperature=0.0,
    max_tokens=300,
    verbose=False,
    callback_manager=CallbackManager([StreamingStdOutCallbackHandler()])
)

template = """Use the context to answer the question. If the answer is not in context, say so.

Context: {context}

Question: {question}

Answer:"""

prompt = PromptTemplate(template=template, input_variables=["context", "question"])
chain = prompt | llm

context = "The company was founded in 2015 and has 500 employees across 3 offices."
question = "How many employees does the company have?"

# result = chain.invoke({"context": context, "question": question})
# Expected: "The company has 500 employees."

llama.cpp vs Ollama

Ollama is llama.cpp with a management layer: model library, automatic GPU detection, REST API, and Modelfile configuration. Use Ollama unless you need programmatic control over loading (partial layer offloading to specific GPUs, custom memory layouts, embedding + generation from the same model instance). For most use cases, Ollama is easier.

02-ollama | 04-quantization