Their thesis is that even when the eval is useless for correctness of a single agentic action in production, it allows you to choose between two agents by cross-comparing in a large aggregated collection of tasks. Effectively: you can tune your agentic parameters.
Nothing new to the idea that taking many samples and averaging can work when a single datapoint doesn’t. Presumably this is part of a conversation in which we’re lacking context.