Evals as Outer Loop | Sutro Handbook

Test-driven development has been a long-standing pattern to create reliable software. The premise is simple: first write tests that guarantee system reliability, then build the software that passes the test suite. The process of building the tests around the system can be thought of as the outer loop, and the architecture and implementation of the system itself can be thought of as the inner loop.

Eval-driven development?

AI engineering is not that different; the only caveat is that you're not seeking guarantees. That is the tradeoff we make when reaching for non-deterministic systems. So while the idea of eval-driven development has been proposed, we do not wholeheartedly endorse it quite yet.

Why? Because foundation models contain some degree of reliability to start with, perhaps even high enough to ship in a low-stakes setting. Thus, evals should serve the purpose of filling in the remaining gaps, not necessarily defining initial behavior. That said, the process of creating evals to identify model failures is extremely important, so incorporating evals early into the development process is highly encouraged.

Why should I care about evals?

So the purpose of evals, in our opinion:

Create a rubric and measurement system from which you can improve an AI system around.
Use that measurement system to improve the AI system during initial development and after production deployment.

There's an implicit, uncomfortable truth buried in there. Can you spot it?

That truth: nearly all AI systems have no real ground-truth.

They should answer questions like:

Is the system reliably doing the task it was built to do?
What are the common failure modes within my control?
Are changes to the AI system's behavior, such as swapping models, changing prompts, or introducing or removing context, improving or hurting performance?
Is the system reliable enough for the workflow, user, and risk level it serves?