Evals are the measurement layer for AI systems. They help teams understand whether a candidate model, agent, prompt, workflow, or retrieval system behaves well enough for the job it is meant to do.
Unlike conventional software tests, evals often need to measure behavior over unstructured inputs and probabilistic outputs. That creates an important distinction: you should never seek or expect 100% test coverage with evals, or you are likely overfitting to a narrow set of cases.
Rather, your goal should be to seek performance at or above human-level capability, as measured by expert-grounded judges.
Pages in This Section
- Evals as Outer Loop: how evals fit into AI development and post-deployment monitoring.
- Eval Patterns: common measurement approaches and how they fit together.
- Where to Start: how to choose the first eval that can change a decision.
- Static Evals vs. Judges: why static evals and LLM judges solve different parts of the AI measurement problem.