Eval Approaches

Common AI eval approaches and the role each one plays in measurement.

Most eval systems combine multiple approaches:

Static Evals

  • Static eval sets: curated examples that are run repeatedly to catch regressions and compare candidate changes.
  • LLM judges: model-based evaluators that map unstructured outputs into bounded labels, rationales, or classifications.
  • Human annotation: expert review used to ground judge behavior, audit model performance, and build trust in measurements.
  • Production sampling: real traces or records sampled from live usage to discover new failure modes and measure field behavior.
  • Operational metrics: latency, cost, refusal rate, escalation rate, completion rate, and other system-level signals.

No single eval method is sufficient on its own. Static evals provide repeatability, judges provide scale over unbounded behavior, and human review provides grounding.