Interview Questions — LLM Evaluation¶
Q1: What is RAGAS and which of its metrics require ground truth labels?¶
Show answer
RAGAS (Retrieval-Augmented Generation Assessment) is a framework for evaluating RAG pipelines. It provides four core metrics:
No ground truth required: - Faithfulness — uses the LLM to check whether each claim in the answer is entailed by the retrieved context. No reference answer needed. - Answer Relevancy — generates back-questions from the answer and measures similarity to the original question via embeddings. No reference answer needed.
Ground truth required: - Context Precision — measures what fraction of the retrieved chunks are relevant to answering the question. Needs a reference answer to judge relevance. - Context Recall — measures what fraction of the ground truth answer is covered by retrieved context. Needs the reference answer to decompose into claims.
The practical implication: you can run faithfulness and answer relevancy at scale without annotation effort. Context precision and recall require a labeled eval set.
Q2: What is the difference between faithfulness and factual accuracy?¶
Show answer
These are often confused but measure different things:
Faithfulness measures whether the answer adheres to the provided context. A faithful answer contains only information that can be found in the retrieved documents.
Factual accuracy measures whether the answer is actually true in the real world.
A faithful answer can be factually wrong — if the retrieved context contains incorrect information, a faithful answer will faithfully reproduce that incorrect information.
Conversely, an unfaithful answer can be factually correct — the model might hallucinate a true fact from its parametric memory (e.g., adding "Python was released in 1991" when the context doesn't mention it).
In RAG systems, you want both: faithfulness ensures the model doesn't add unsupported claims; factual accuracy requires the underlying documents to be correct. You control faithfulness with prompt engineering and grounding instructions; factual accuracy depends on your data quality.
Q3: Explain the difference between context precision and context recall. What causes each to be low?¶
Show answer
Context precision = of the chunks you retrieved, how many are actually relevant? - Low precision: you're retrieving noise — irrelevant chunks that dilute the good content - Root cause: poor embedding model, too-large k, chunking that mixes unrelated topics - Fix: better embedding model, reduce k, add reranking, improve chunking
Context recall = of the ground truth answer's content, how much is covered by the retrieved chunks? - Low recall: you're missing the documents that contain the answer - Root cause: relevant documents not retrieved — poor embedding model, poor chunking, documents not in index - Fix: better chunking, add query expansion, hybrid search, check your ingestion pipeline
Think of it with a precision/recall analogy from classification: - Precision: what fraction of what you returned was correct? - Recall: what fraction of what was correct did you return?
A system with high precision, low recall returns few but relevant chunks — it's conservative. A system with high recall, low precision returns many chunks including relevant ones — it's noisy.
Q4: What biases affect LLM-as-judge evaluation, and how do you mitigate them?¶
Show answer
Four well-documented biases:
Verbosity bias: Longer, more detailed answers score higher regardless of quality. Mitigation: evaluate each dimension separately (correctness vs. completeness vs. clarity); never ask for a single "quality" score.
Position/order bias: When comparing two answers (A vs B), the first tends to win. Mitigation: run each comparison twice with A and B swapped; take the average or flag disagreements.
Self-preference: GPT-4o rates GPT-4o outputs higher; Claude rates Claude outputs higher. Mitigation: use a different model as judge than the one being evaluated; use multiple judges.
Anchoring: Seeing a bad example first makes subsequent examples seem better in comparison. Mitigation: randomize eval set order; use absolute rubrics rather than relative comparisons when anchoring is a concern.
For production systems, validate your LLM judge on a human-labeled subset. If the judge's rankings correlate with human rankings (Spearman ρ > 0.7), it's trustworthy at scale.
Q5: What is Cohen's kappa and when would you use it in an LLM evaluation context?¶
Show answer
Cohen's kappa is a measure of inter-rater agreement that corrects for chance:
κ = (P_observed - P_expected) / (1 - P_expected)
Where: - P_observed = fraction of items where both raters agreed - P_expected = agreement expected by random chance, given the distribution of ratings
Why correct for chance? If one rater gives "good" 80% of the time and another gives "good" 80% of the time, they'll agree ~64% of the time purely by chance even if their judgments are uncorrelated.
Interpretation: - κ < 0.40: Poor — don't use these raters together; run calibration - 0.40–0.60: Moderate — acceptable for exploratory work - 0.60–0.80: Substantial — suitable for production annotation - κ > 0.80: Near-perfect — well-calibrated team
When to use: Before using human annotators for production eval, measure kappa on a calibration set of ~30 examples. If kappa < 0.60, run a calibration session where raters discuss their disagreements before continuing.
You can also measure kappa between an LLM judge and human raters to validate whether the LLM judge is trustworthy.
Q6: How would you design an evaluation pipeline for a new RAG feature before shipping it?¶
Show answer
A practical pre-ship eval pipeline has three stages:
Stage 1 — Automated regression suite (runs in CI) - 50–100 labeled examples covering the feature's scope - RAGAS faithfulness and answer relevancy scored automatically - Retrieval Recall@5 on labeled relevant documents - Must pass: faithfulness > 0.80, answer relevancy > 0.75, Recall@5 > 0.85 - Runtime: < 5 minutes
Stage 2 — LLM-as-judge on broader eval set (runs pre-merge) - 200–500 examples including edge cases and adversarial inputs - GPT-4o judges correctness, completeness, and citation quality (1–5 rubric) - Compares scores vs the previous version (regression detection) - Must pass: no statistically significant regression on any dimension
Stage 3 — Human spot check (runs pre-launch) - 20–30 examples selected from the hardest cases and new capability demonstrations - 2 internal raters score each example on the same rubric as Stage 2 - Must pass: average human score ≥ 4.0 on correctness, kappa > 0.60 - This is the gate that the automated pipeline can't replace
Output: a one-page eval report summarizing all three stages, linked in the PR.
Q7: What is Recall@k and MRR, and how do they differ?¶
Show answer
Both evaluate retrieval quality given labeled relevant documents.
Recall@k = what fraction of all relevant documents appear in the top k retrieved results? - Answers: "Does the retriever find the relevant documents?" - Example: if there are 3 relevant docs and 2 appear in the top 5, Recall@5 = 0.67 - Best for: systems where you need all relevant documents (comprehensive recall matters)
MRR (Mean Reciprocal Rank) = average of 1/rank_of_first_relevant_doc across queries - Answers: "How high does the first relevant document appear?" - Example: if the first relevant doc is at rank 2, RR = 0.5. Averaged across queries = MRR - Best for: systems where users look at the top result and move on (position matters most)
Key difference: Recall@k cares about finding all relevant docs; MRR cares about finding at least one relevant doc as high as possible.
For RAG: Recall@k is usually more important because missing a relevant document means the LLM won't have access to it. A high MRR with low recall means the best answer ranks first but other important context is missing.
06-practice-exercises | ../Day-04-Part-2-Responsible-AI-and-Safety/00-agenda