Quantization¶
Quantization compresses model weights from high-precision floats to lower-precision integers. The goal is the same as image compression: reduce file size with minimal perceptible quality loss. Understanding quantization helps you choose the right format for your hardware and quality requirements.
Learning objectives¶
- Understand the difference between GGUF quantization levels
- Know when to use Q4, Q6, and Q8 formats
- Apply GPTQ and AWQ quantization for GPU inference
- Measure quality degradation across quantization levels
What quantization does¶
A model weight is originally stored as a float32 (4 bytes, 32 bits). Quantization maps these values to integers with fewer bits.
float32 (32 bits): 0.15234375 → exact value
float16 (16 bits): 0.152 → 2× smaller, tiny loss
int8 (8 bits): 24 → 4× smaller, small loss
int4 (4 bits): 1 → 8× smaller, noticeable loss
int2 (2 bits): 0 → 16× smaller, significant loss
The quantization error accumulates across hundreds of layers. At 4-bit, the error is usually small enough that most users can't tell the difference in conversation quality. At 2-bit, quality degrades substantially.
GGUF quantization types¶
llama.cpp and Ollama use GGUF quantization with several variants. The naming follows a pattern:
| Format | Bits | Size (7B model) | Quality | Use case |
|---|---|---|---|---|
| F32 | 32 | 28 GB | Reference | Never — too large |
| F16 | 16 | 14 GB | ≈100% | GPU training/inference |
| Q8_0 | 8 | 7.2 GB | ~99.5% | High quality, have VRAM |
| Q6_K | 6 | 5.5 GB | ~99% | Good quality, less VRAM |
| Q5_K_M | 5 | 4.8 GB | ~98.5% | Balanced |
| Q4_K_M | 4 | 4.1 GB | ~97% | Best default choice |
| Q4_0 | 4 | 3.9 GB | ~96% | Slightly worse than Q4_K_M |
| Q3_K_M | 3 | 3.1 GB | ~94% | Size-constrained only |
| Q2_K | 2 | 2.2 GB | ~88% | Avoid unless necessary |
K-quantization (K_M, K_S, K_L): Uses k-means clustering to find optimal quantization centroids. Better quality than simple quantization at the same bit width. K_M (medium) is the standard choice; K_S is smaller (fewer bits for non-critical layers); K_L is larger.
def recommend_quantization(vram_gb: float, quality_priority: str = "balanced") -> str:
"""Recommend GGUF quantization given available VRAM and quality need."""
if quality_priority == "max_quality":
if vram_gb >= 14:
return "F16 — maximum quality, full float16"
if vram_gb >= 7:
return "Q8_0 — near-lossless, ~99.5% of F16 quality"
if vram_gb >= 5.5:
return "Q6_K — very high quality, ~99%"
elif quality_priority == "balanced":
if vram_gb >= 8:
return "Q5_K_M — excellent balance of size and quality"
if vram_gb >= 4.5:
return "Q4_K_M — best default; ~97% quality, 4.1GB for 7B"
if vram_gb >= 3.5:
return "Q4_0 — slightly less accurate than Q4_K_M"
else: # max_compression
if vram_gb >= 3.5:
return "Q3_K_M — compact, acceptable quality"
return "Q2_K — last resort; noticeable quality loss"
return "Q4_K_M — default recommendation"
print(recommend_quantization(6.0)) # Q4_K_M or Q5_K_M
print(recommend_quantization(16.0, "max_quality")) # Q8_0
print(recommend_quantization(2.5, "max_compression")) # Q2_K
GPTQ — GPU-optimized quantization¶
GPTQ uses the Hessian of the loss to find optimal quantization weights, losing much less quality than naive rounding. It targets GPU inference (not CPU like GGUF).
# Load a pre-quantized GPTQ model from the Hub
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig
# Many models have pre-quantized GPTQ versions uploaded by TheBloke
# Example: "TheBloke/Mistral-7B-Instruct-v0.2-GPTQ"
gptq_config = GPTQConfig(
bits=4,
dataset="c4",
tokenizer=AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3")
)
# Load pre-quantized model (no quantization happens at load time)
model = AutoModelForCausalLM.from_pretrained(
"TheBloke/Mistral-7B-Instruct-v0.2-GPTQ",
device_map="auto",
quantization_config=gptq_config
)
tokenizer = AutoTokenizer.from_pretrained("TheBloke/Mistral-7B-Instruct-v0.2-GPTQ")
print(f"Loaded GPTQ model on: {model.device}")
AWQ — Activation-aware Weight Quantization¶
AWQ (2023) identifies which weights are most important by examining activations, then quantizes unimportant weights more aggressively. It achieves better quality than GPTQ at 4-bit.
# pip install autoawq
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
# Load an AWQ-quantized model
model_id = "casperhansen/mistral-7b-instruct-v0.1-awq"
model = AutoAWQForCausalLM.from_quantized(
model_id,
fuse_layers=True, # Fuse attention layers for faster inference
trust_remote_code=False,
safetensors=True
)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=False)
# Generate
inputs = tokenizer("Explain cosine similarity:", return_tensors="pt").to(model.device)
with __import__("torch").no_grad():
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
Measuring quality degradation¶
Compare quantized vs full-precision outputs to understand the real quality gap.
import os
from openai import OpenAI
# Use Ollama to compare quantization levels side-by-side
client_q4 = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
EVAL_PROMPTS = [
"What is 237 × 48? Show your working.",
"If all roses are flowers and some flowers fade quickly, can we conclude that some roses fade quickly?",
"Write a Python function that checks if a string is a palindrome.",
"The user asked: 'How do I learn machine learning?' Give a structured 5-step plan.",
]
def eval_model(model: str, prompts: list[str]) -> list[str]:
responses = []
for prompt in prompts:
resp = client_q4.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
temperature=0.0,
max_tokens=200
)
responses.append(resp.choices[0].message.content)
return responses
# Compare Q4 vs Q8 (if you have both pulled in Ollama)
models_to_test = [
"llama3.1:8b", # Default Q4_K_M in Ollama
"llama3.1:8b-instruct-q8_0", # Q8 version
]
for model in models_to_test:
print(f"\n{'='*50}")
print(f"Model: {model}")
responses = eval_model(model, EVAL_PROMPTS[:1]) # Test one prompt
print(f"Response: {responses[0][:200]}...")
Practical quantization guidance
For most production use cases: Q4_K_M is sufficient. Independent benchmarks consistently show < 3% quality degradation on MMLU, HellaSwag, and GSM8k compared to F16. The exceptions are: complex multi-step math (where Q8+ is noticeably better), and code generation with edge cases. If your use case involves either, step up to Q6_K or Q8_0 if you have the VRAM.
| Quantization | When to use |
|---|---|
| F16 | Fine-tuning, quality-critical production with plenty of VRAM |
| Q8_0 | High-quality production where VRAM allows |
| Q5_K_M | Best balance for 8–12 GB VRAM systems |
| Q4_K_M | Default choice for most use cases |
| Q3_K_M | Only when fitting the model is the priority |
| Q2_K | Avoid — quality loss is usually unacceptable |