Agenda — LLM Evaluation¶
Shipping an LLM feature without evaluation is shipping blind. You need to measure what you can't see: faithfulness, relevance, hallucination rate, and whether your prompt changes actually improved anything.
Learning objectives¶
By the end of this session you will be able to:
- Design an evaluation strategy for LLM outputs
- Use RAGAS to evaluate RAG pipeline quality
- Implement LLM-as-judge for automated scoring
- Build a human evaluation framework with calibrated rubrics
Schedule¶
| Time | Topic | File |
|---|---|---|
| 0:00 – 0:25 | Evaluation overview — why evals, types, design | 01-evaluation-overview |
| 0:25 – 1:00 | RAGAS framework — faithfulness, relevance, recall | 02-ragas-framework |
| 1:00 – 1:30 | Hallucination and faithfulness metrics | 03-hallucination-and-faithfulness |
| 1:30 – 2:00 | Relevance metrics — answer, context, query | 04-relevance-metrics |
| 2:00 – 2:30 | Human evaluations — rubrics, inter-rater, calibration | 05-human-evals |
| 2:30 – 3:00 | Practice exercises | 06-practice-exercises |