This matches what I've seen working with automated systems. The watching part is
genuinely underrated. Evals give you a score. Watching gives you intuition about
failure modes you didn't know to test for.
Sitting with a running system teaches you things you would never think to measure.
Sitting with a running system teaches you things you would never think to measure.