Just read the f***ing data
In working with customers deploying AI systems, one thing is often clear: nearly all model behavior problems are actually inference-time data problems.
If you see a smart model consistently failing at a certain task it's typically not because the model is trained poorly, but rather you are supplying bad/missing instructions or data for the model to use as available context.
Therefore, the first reflex to improving task performance should just be tearing into a representative underlying cut of data the task is being run on.
Many teams will go overboard at the outset: buying observability products, automated monitoring tools, or integrating off-the-shelf eval products. But more reasonably, you should just find a way to get model inputs and outputs into an interface where you and/or a domain expert can review them. Manually reading over just a handful of results will often provide a massive diagnostic lift to start understand where to use scaled approaches.
Expert-in-the-loop (EITL)
Often times, the developer of an AI product is not the domain expert of the task that the product aims to augment or automate. Before you start annotating it's important to do an honest read of the situation - are you the one whose judgement the model should be using? If you are not the expert, designate one who is for best results.
One, or multiple experts?
It may be helpful to have multiple experts in some cases, but it's likely simpler to produce a single expert annotation per case being reviewed. Otherwise you'll be forced to use some sort of post-processing logic for adjudicating disagreements between experts.
So even if multiple voices are in the room, it's best to consolidate their opinions into a single annotation.
Are expert annotations just static evals?
No. Static evals are individual test cases, or expected output contracts with a model. They're the passive, defensive cousin of expert annotations.
Sutro believes in using expert annotations as the primary learning signal from which model behavior should be derived. In the literature, this technique is known as reinforcement learning from human feedback (RLHF) and is one of the primary ways in which foundation model providers align behavior in the first place. But while the foundation model providers use RLHF to create general model behavior, your goal in collecting annotations is to create the last-mile subjective learnings a model needs to complete a task like an expert would.