Skip to content

Quantization

Quantization compresses model weights from high-precision floats to lower-precision integers. The goal is the same as image compression: reduce file size with minimal perceptible quality loss. Understanding quantization helps you choose the right format for your hardware and quality requirements.

Learning objectives

  • Understand the difference between GGUF quantization levels
  • Know when to use Q4, Q6, and Q8 formats
  • Apply GPTQ and AWQ quantization for GPU inference
  • Measure quality degradation across quantization levels

What quantization does

A model weight is originally stored as a float32 (4 bytes, 32 bits). Quantization maps these values to integers with fewer bits.

float32 (32 bits):  0.15234375  → exact value
float16 (16 bits):  0.152       → 2× smaller, tiny loss
int8    (8 bits):   24          → 4× smaller, small loss
int4    (4 bits):   1           → 8× smaller, noticeable loss
int2    (2 bits):   0           → 16× smaller, significant loss

The quantization error accumulates across hundreds of layers. At 4-bit, the error is usually small enough that most users can't tell the difference in conversation quality. At 2-bit, quality degrades substantially.


GGUF quantization types

llama.cpp and Ollama use GGUF quantization with several variants. The naming follows a pattern:

Format Bits Size (7B model) Quality Use case
F32 32 28 GB Reference Never — too large
F16 16 14 GB ≈100% GPU training/inference
Q8_0 8 7.2 GB ~99.5% High quality, have VRAM
Q6_K 6 5.5 GB ~99% Good quality, less VRAM
Q5_K_M 5 4.8 GB ~98.5% Balanced
Q4_K_M 4 4.1 GB ~97% Best default choice
Q4_0 4 3.9 GB ~96% Slightly worse than Q4_K_M
Q3_K_M 3 3.1 GB ~94% Size-constrained only
Q2_K 2 2.2 GB ~88% Avoid unless necessary

K-quantization (K_M, K_S, K_L): Uses k-means clustering to find optimal quantization centroids. Better quality than simple quantization at the same bit width. K_M (medium) is the standard choice; K_S is smaller (fewer bits for non-critical layers); K_L is larger.

def recommend_quantization(vram_gb: float, quality_priority: str = "balanced") -> str:
    """Recommend GGUF quantization given available VRAM and quality need."""

    if quality_priority == "max_quality":
        if vram_gb >= 14:
            return "F16 — maximum quality, full float16"
        if vram_gb >= 7:
            return "Q8_0 — near-lossless, ~99.5% of F16 quality"
        if vram_gb >= 5.5:
            return "Q6_K — very high quality, ~99%"
    elif quality_priority == "balanced":
        if vram_gb >= 8:
            return "Q5_K_M — excellent balance of size and quality"
        if vram_gb >= 4.5:
            return "Q4_K_M — best default; ~97% quality, 4.1GB for 7B"
        if vram_gb >= 3.5:
            return "Q4_0 — slightly less accurate than Q4_K_M"
    else:  # max_compression
        if vram_gb >= 3.5:
            return "Q3_K_M — compact, acceptable quality"
        return "Q2_K — last resort; noticeable quality loss"

    return "Q4_K_M — default recommendation"

print(recommend_quantization(6.0))  # Q4_K_M or Q5_K_M
print(recommend_quantization(16.0, "max_quality"))  # Q8_0
print(recommend_quantization(2.5, "max_compression"))  # Q2_K

GPTQ — GPU-optimized quantization

GPTQ uses the Hessian of the loss to find optimal quantization weights, losing much less quality than naive rounding. It targets GPU inference (not CPU like GGUF).

# Load a pre-quantized GPTQ model from the Hub
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig

# Many models have pre-quantized GPTQ versions uploaded by TheBloke
# Example: "TheBloke/Mistral-7B-Instruct-v0.2-GPTQ"

gptq_config = GPTQConfig(
    bits=4,
    dataset="c4",
    tokenizer=AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3")
)

# Load pre-quantized model (no quantization happens at load time)
model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Mistral-7B-Instruct-v0.2-GPTQ",
    device_map="auto",
    quantization_config=gptq_config
)
tokenizer = AutoTokenizer.from_pretrained("TheBloke/Mistral-7B-Instruct-v0.2-GPTQ")
print(f"Loaded GPTQ model on: {model.device}")

AWQ — Activation-aware Weight Quantization

AWQ (2023) identifies which weights are most important by examining activations, then quantizes unimportant weights more aggressively. It achieves better quality than GPTQ at 4-bit.

# pip install autoawq
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

# Load an AWQ-quantized model
model_id = "casperhansen/mistral-7b-instruct-v0.1-awq"
model = AutoAWQForCausalLM.from_quantized(
    model_id,
    fuse_layers=True,     # Fuse attention layers for faster inference
    trust_remote_code=False,
    safetensors=True
)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=False)

# Generate
inputs = tokenizer("Explain cosine similarity:", return_tensors="pt").to(model.device)
with __import__("torch").no_grad():
    outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Measuring quality degradation

Compare quantized vs full-precision outputs to understand the real quality gap.

import os
from openai import OpenAI

# Use Ollama to compare quantization levels side-by-side
client_q4 = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

EVAL_PROMPTS = [
    "What is 237 × 48? Show your working.",
    "If all roses are flowers and some flowers fade quickly, can we conclude that some roses fade quickly?",
    "Write a Python function that checks if a string is a palindrome.",
    "The user asked: 'How do I learn machine learning?' Give a structured 5-step plan.",
]

def eval_model(model: str, prompts: list[str]) -> list[str]:
    responses = []
    for prompt in prompts:
        resp = client_q4.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            temperature=0.0,
            max_tokens=200
        )
        responses.append(resp.choices[0].message.content)
    return responses

# Compare Q4 vs Q8 (if you have both pulled in Ollama)
models_to_test = [
    "llama3.1:8b",          # Default Q4_K_M in Ollama
    "llama3.1:8b-instruct-q8_0",  # Q8 version
]

for model in models_to_test:
    print(f"\n{'='*50}")
    print(f"Model: {model}")
    responses = eval_model(model, EVAL_PROMPTS[:1])  # Test one prompt
    print(f"Response: {responses[0][:200]}...")

Practical quantization guidance

For most production use cases: Q4_K_M is sufficient. Independent benchmarks consistently show < 3% quality degradation on MMLU, HellaSwag, and GSM8k compared to F16. The exceptions are: complex multi-step math (where Q8+ is noticeably better), and code generation with edge cases. If your use case involves either, step up to Q6_K or Q8_0 if you have the VRAM.


Quantization When to use
F16 Fine-tuning, quality-critical production with plenty of VRAM
Q8_0 High-quality production where VRAM allows
Q5_K_M Best balance for 8–12 GB VRAM systems
Q4_K_M Default choice for most use cases
Q3_K_M Only when fitting the model is the priority
Q2_K Avoid — quality loss is usually unacceptable

03-llama-cpp | 05-when-to-run-locally