Judges | Sutro Handbook

If you are building in AI, you have likely come across a lot of content and products around evals. However, there is not much useful information around LLM-as-a-judge, which underpins many, if not most, modern evals.

But LLM judges are not just for evals. AI models that make judgment calls, or decisions that would otherwise be handled by a human, unlock a massive amount of utility because they can be run at a cost and scale otherwise impossible to match with real people. But judges are only useful when they can make decisions as good as or better than the humans they are proxying.

This section will cover what judges are, where they're often used, and some principles around task design to use them effectively.

In This Section

Judge Terminology The core vocabulary used when discussing LLM judges and candidate models. What's in a Judge? The model, context, input, and output schema that make up a typical LLM judge. Types of Judges Reliability, quality, sentiment, and intent judges in the judge-design hierarchy. Judges in Evals: Flip Your Intuition First-principles responses to common objections about using LLMs to judge LLMs. Good Task Design Is All You Need The design knobs that make LLM judges more reliable, measurable, and useful.