Human Evaluations¶

Automated metrics tell you whether the model is faithful and relevant. Human evaluation tells you whether users actually find the output helpful. Both matter — and calibrated human evals are harder to set up than most teams realize.

Learning objectives¶

Design a rubric-based human evaluation protocol
Implement pairwise preference evaluation (A/B comparison)
Measure and improve inter-rater agreement using Cohen's kappa
Build a lightweight human eval pipeline in Python

Why you still need humans¶

LLM-as-judge is powerful but has known failure modes:

Verbosity bias — longer answers score higher even when less accurate
Self-preference — GPT-4o rates GPT-4o outputs higher; Claude rates Claude outputs higher
Anchoring — the order of presented options influences the score
Subtle quality — helpfulness, trust, and tone require lived human experience

"Automated metrics are a proxy for the thing you actually care about, which is whether users succeed at their task." — Liang et al., HELM benchmark paper

Human evaluation is the ground truth. Automated metrics are approximations of it.

Rubric design¶

A bad rubric: "Rate the answer 1–5." A good rubric: specific dimensions with anchored descriptions.

from dataclasses import dataclass, field
from typing import Optional
import json

@dataclass
class RubricDimension:
    name: str
    description: str
    anchors: dict[int, str]  # score → description

# Production-grade rubric for a RAG Q&A system
QA_RUBRIC = [
    RubricDimension(
        name="correctness",
        description="Is the factual content of the answer accurate?",
        anchors={
            1: "Contains significant factual errors",
            2: "Mostly incorrect or misleading",
            3: "Partially correct with notable gaps",
            4: "Mostly correct, minor inaccuracies",
            5: "Fully accurate and verifiable"
        }
    ),
    RubricDimension(
        name="completeness",
        description="Does the answer cover all aspects of the question?",
        anchors={
            1: "Misses the main point of the question",
            2: "Addresses only one part of a multi-part question",
            3: "Covers the main point, misses secondary aspects",
            4: "Covers most aspects, minor omissions",
            5: "Complete answer addressing all aspects"
        }
    ),
    RubricDimension(
        name="clarity",
        description="Is the answer easy to understand and well-structured?",
        anchors={
            1: "Confusing, contradictory, or unreadable",
            2: "Hard to follow, poor organization",
            3: "Clear but could be better structured",
            4: "Clear and well-organized",
            5: "Exceptionally clear, easy to skim and act on"
        }
    ),
    RubricDimension(
        name="citation_quality",
        description="Are sources cited appropriately and accurately?",
        anchors={
            1: "No citations where needed, or all citations wrong",
            2: "Citations present but mostly irrelevant",
            3: "Some correct citations, some missing",
            4: "Good citations, minor issues",
            5: "All claims backed by accurate, specific citations"
        }
    ),
]

def format_rubric_for_annotator(rubric: list[RubricDimension]) -> str:
    lines = ["## Scoring rubric\n"]
    for dim in rubric:
        lines.append(f"### {dim.name.replace('_', ' ').title()}")
        lines.append(f"{dim.description}\n")
        for score, desc in sorted(dim.anchors.items()):
            lines.append(f"- **{score}** — {desc}")
        lines.append("")
    return "\n".join(lines)

print(format_rubric_for_annotator(QA_RUBRIC[:2]))

Pairwise preference evaluation¶

Absolute scores are noisy; humans are bad at calibrating across sessions. Pairwise comparisons ("which is better, A or B?") are more reliable and reveal real differences.

@dataclass
class PairwiseTask:
    question: str
    answer_a: str
    answer_b: str
    context: Optional[str] = None
    metadata: dict = field(default_factory=dict)

@dataclass
class PairwiseResult:
    task_id: str
    rater_id: str
    preference: str        # "A", "B", or "tie"
    confidence: int        # 1 (slight) to 3 (clear)
    reasoning: str
    dimension_scores: dict[str, dict[str, int]]  # {"A": {"correctness": 4}, "B": {...}}

def create_pairwise_prompt(task: PairwiseTask) -> str:
    context_section = f"\n\nContext available to the assistant:\n{task.context}" if task.context else ""
    return f"""Compare these two answers to the same question. Evaluate them on correctness, completeness, and clarity.

Question: {task.question}{context_section}

Answer A:
{task.answer_a}

Answer B:
{task.answer_b}

Which answer is better overall? Respond with:
- preference: "A", "B", or "tie"
- confidence: 1 (slight preference) to 3 (clearly better)
- reasoning: one sentence
- dimension_scores: {{"A": {{"correctness": 1-5, "completeness": 1-5, "clarity": 1-5}}, "B": {{...}}}}

Return JSON only."""

def llm_pairwise_judge(task: PairwiseTask, task_id: str) -> PairwiseResult:
    from openai import OpenAI
    client = OpenAI(api_key=__import__("os").getenv("OPENAI_API_KEY"))

    # Randomize order to reduce position bias — swap A and B 50% of the time
    import random
    swapped = random.random() > 0.5
    if swapped:
        task_for_eval = PairwiseTask(
            question=task.question,
            answer_a=task.answer_b,
            answer_b=task.answer_a,
            context=task.context
        )
    else:
        task_for_eval = task

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": create_pairwise_prompt(task_for_eval)}],
        temperature=0.0,
        max_tokens=300,
        response_format={"type": "json_object"}
    )
    result = json.loads(response.choices[0].message.content)

    # Un-swap if needed
    if swapped and result["preference"] in ("A", "B"):
        result["preference"] = "B" if result["preference"] == "A" else "A"

    return PairwiseResult(
        task_id=task_id,
        rater_id="gpt-4o",
        preference=result["preference"],
        confidence=result["confidence"],
        reasoning=result["reasoning"],
        dimension_scores=result.get("dimension_scores", {})
    )

Inter-rater agreement¶

When multiple humans (or LLM judges) rate the same examples, they will disagree. Inter-rater agreement measures how often they agree — and whether agreement is better than chance.

from collections import Counter
import math

def cohens_kappa(ratings_a: list, ratings_b: list) -> float:
    """
    Cohen's kappa for two raters on the same items.
    kappa = (P_o - P_e) / (1 - P_e)
    where P_o = observed agreement, P_e = expected agreement by chance.
    """
    assert len(ratings_a) == len(ratings_b), "Same number of items required"
    n = len(ratings_a)
    categories = sorted(set(ratings_a) | set(ratings_b))

    # Observed agreement
    p_observed = sum(1 for a, b in zip(ratings_a, ratings_b) if a == b) / n

    # Expected agreement by chance
    counts_a = Counter(ratings_a)
    counts_b = Counter(ratings_b)
    p_expected = sum(
        (counts_a.get(c, 0) / n) * (counts_b.get(c, 0) / n)
        for c in categories
    )

    if p_expected == 1.0:
        return 1.0

    return (p_observed - p_expected) / (1 - p_expected)

def interpret_kappa(kappa: float) -> str:
    if kappa < 0:
        return "Worse than chance"
    if kappa < 0.20:
        return "Slight agreement"
    if kappa < 0.40:
        return "Fair agreement"
    if kappa < 0.60:
        return "Moderate agreement"
    if kappa < 0.80:
        return "Substantial agreement"
    return "Almost perfect agreement"

# Simulate two raters scoring 10 answers on correctness (1-5)
rater_1 = [5, 4, 3, 5, 2, 4, 3, 5, 4, 3]
rater_2 = [5, 4, 2, 5, 2, 3, 3, 4, 4, 3]

kappa = cohens_kappa(rater_1, rater_2)
print(f"Cohen's kappa: {kappa:.3f} — {interpret_kappa(kappa)}")
# Cohen's kappa: 0.712 — Substantial agreement

# Target: kappa > 0.6 before using raters for production evaluation
# If kappa < 0.4: run calibration session — raters review disagreements together

Calibration beats training

The fastest way to improve inter-rater agreement is a 30-minute calibration session where raters score the same 10 examples independently, then discuss disagreements. Calibrated raters reach kappa > 0.7 much faster than raters who read a rubric and start scoring.

Building a lightweight human eval pipeline¶

import csv
import os
from datetime import datetime

class HumanEvalPipeline:
    """
    Generates annotation tasks as CSV for human raters.
    Collects results and computes aggregate scores.
    """

    def __init__(self, output_dir: str = "eval_tasks"):
        self.output_dir = output_dir
        os.makedirs(output_dir, exist_ok=True)

    def generate_tasks(
        self,
        eval_examples: list[dict],
        dimensions: list[str]
    ) -> str:
        """Write annotation task CSV. One row per example."""
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        path = os.path.join(self.output_dir, f"tasks_{timestamp}.csv")

        fieldnames = ["id", "question", "answer", "context"] + dimensions + ["notes"]

        with open(path, "w", newline="", encoding="utf-8") as f:
            writer = csv.DictWriter(f, fieldnames=fieldnames)
            writer.writeheader()
            for ex in eval_examples:
                row = {
                    "id": ex["id"],
                    "question": ex["question"],
                    "answer": ex["answer"],
                    "context": ex.get("context", ""),
                }
                row.update({dim: "" for dim in dimensions})
                row["notes"] = ""
                writer.writerow(row)

        print(f"Tasks written to: {path}")
        return path

    def load_results(self, results_path: str) -> list[dict]:
        with open(results_path, newline="", encoding="utf-8") as f:
            return list(csv.DictReader(f))

    def aggregate_results(
        self,
        results: list[dict],
        dimensions: list[str]
    ) -> dict:
        scores = {dim: [] for dim in dimensions}

        for row in results:
            for dim in dimensions:
                val = row.get(dim, "").strip()
                if val.isdigit():
                    scores[dim].append(int(val))

        summary = {}
        for dim, vals in scores.items():
            if vals:
                summary[dim] = {
                    "mean": sum(vals) / len(vals),
                    "n": len(vals),
                    "pass_rate": sum(1 for v in vals if v >= 4) / len(vals)
                }

        return summary

# Usage
pipeline = HumanEvalPipeline(output_dir="human_evals")

examples = [
    {"id": "q1", "question": "What is our return policy?",
     "answer": "Returns are accepted within 30 days.", "context": "Policy doc text..."},
    {"id": "q2", "question": "How do I cancel my subscription?",
     "answer": "Go to Settings → Subscription → Cancel.", "context": "Help center text..."},
]

task_path = pipeline.generate_tasks(
    examples,
    dimensions=["correctness", "completeness", "clarity"]
)
# Raters open the CSV, fill in scores 1-5, save

# After collection:
# results = pipeline.load_results("human_evals/tasks_20240101_120000_filled.csv")
# summary = pipeline.aggregate_results(results, ["correctness", "completeness", "clarity"])
# print(summary)

When to use human vs automated evaluation¶

Signal	Use automated eval	Use human eval
Correctness (objective)	✓ Reference-based metrics	✓ Spot check
Faithfulness	✓ NLI, LLM-as-judge	✓ For high-stakes content
Helpfulness	✗ Hard to automate	✓ Required
Tone / trust	✗ Unreliable	✓ Required
Regression detection	✓ Fast automated suite	✗ Too slow
New capability launch	✗ No baseline	✓ Required
A/B comparison	✓ Pairwise LLM-judge	✓ Ground truth

Practical rule

Automate 90% of your evaluation. Use human eval for: every major prompt change, every new feature launch, and random 1% sampling of production traffic. This coverage costs ~$50–200/month in annotation time and catches the things automated metrics miss.

04-relevance-metrics | 06-practice-exercises