Experienced engineers that know the codebase and system well, and with enough time to consider the problem properly would likely consider this case.
But if we're vibing... This is the kind of bug that should make it back into a review agent/skill's instructions in a more generic format. Essentially if something is done to the message history, check there tests that subsequent turns work as expected.
But yeah, you'd have to piss off a bunch of users in prod first to discover the blind spot.
I can't remember what the technique is called, but back in the GPT 4 days there was a paper published about having a number of attempts at responding to a prompt and then having a final pass where it picks the best one. I believe this is part of how the "Pro" GPT variant works, and Cursor also supports this in a way (though I'm not sure if the auto pick best one at the end is part of it - never tried)
This is an excellent article. And can I just remind everyone that this is what human authorship looks like? Clearly not LLM generated. It has the author's unique tone, take on the subject, research, clear compelling story... A real breath of fresh air to be honest.
Recently think that Ben's writing is more complex and verbose than ever, but I agree with your point entirely. He is writing it, not AI. I don't listen to his voiceovers but think of the articles as narrated by a captivating in-person presenter/lecturer.
I got a lot of <empty> as well. But was able to get a slide deck out of it before that happened, and it was reasonably good. Not 1-shot good, but better than what I have gotten out of Opus 4.6 with a skill previously
And India. It's a common experience that engineering teams from India will say yes to everything and then do what they think is best. Rather than saying no and explaining what they want to do instead
But if we're vibing... This is the kind of bug that should make it back into a review agent/skill's instructions in a more generic format. Essentially if something is done to the message history, check there tests that subsequent turns work as expected.
But yeah, you'd have to piss off a bunch of users in prod first to discover the blind spot.
reply