The last section probably reminded you of how much control you yield when building AI systems. But as a good engineer, your job is to design systems around what can be controlled and mitigate the effects of known unknowns.
Fortunately, you have a lot of control over judge design decisions. At Sutro, we refer to this as task design. Many of these principles can be reused across the rest of the primitives.
Your available knobs are:
| Knob | Question | Example |
|---|---|---|
| Atomicity | Can the task be decomposed, allowing many judges to evaluate small components of the result rather than one judge evaluating the entire result? | Instead of "Is this result good?", ask "Did the response include a URL for the user if one was requested?" and "Did the model produce clear user instructions?" |
| Structure | Can the task be reduced to a binary or three-class outcome? | pass / fail; true / false / insufficient_evidence. |
| Specificity | Can each task be defined extremely clearly, such that any smart human or generally strong instruction-following model knows how to complete it? | Instead of "Did the model produce clear user instructions?", define what a clear instruction requires, in order of importance: input language, grammatical clarity, correct understanding of the problem, and so on. |
| Generalization | Instead of defining the task solely using few-shot examples, can it be represented through an abstract constitution of rules that will generalize to examples the judge has not seen? | Instead of only providing examples, define a rule: "When users try to directly provide financial data, the model should refuse to accept it." |
| Measurement | Can you get real, numerical inter-rater reliability metrics between your judge and human experts, including in-sample training data and held-out set performance, all on real human-labeled data? | Human/judge agreement: 85%. Held-out human/judge agreement: 83%. |
The overall goal is task decomposition, model steering, and a general learning approach. Part of the benefit of using pre-trained models is that we can rely on what they already know and fill in last-mile learning gaps rather than starting from scratch.
At Sutro, we use a statistical learning approach that presents ambiguous cases for labeling and steering, and high-confidence cases for auditing. Users provide feedback, and we use automated prompt optimization tooling to abstract strong, general decision rules into a system prompt.
You need to do the work of coming up with a good task design. Sutro provides the infrastructure for annotation, optimization, and measurement.