> Our latest frontier models have shown particular strengths in their ability to do long-running tasks, working autonomously for hours, days or weeks without intervention.
I have yet to see this (produce anything actually useful).
I've been finding that the Opus 4.5/4.6 and GPT-5.2/5.3 models really have represented a step-change in how good they are at running long tasks.
I can one-shot prompt all sorts of useful coding challenges now that previously I would have expected to need multiple follow-ups to fix mistakes the agents made.
No, not for days - but it churned away on that one for about ten minutes.
I don't think I've got any examples of multi-hour or multi-day sessions that ran completely uninterrupted - this one back in December took 4.5 hours but I had to prompt it to keep going a few times along the way: https://simonwillison.net/2025/Dec/15/porting-justhtml/
Maybe so, but I did once spend 12 hours straight debugging an Emscripten C++ compiler bug! (After spending the first day of the jam setting up Emscripten, and the second day getting Raylib to compile in it. Had like an hour left to make the actual game, hahah.)
I am a bit thick with such things, but just wanted to provide the context that Emscripten can be a fickle beast :)
I sure am glad I can now deploy Infinite Mechanized Autistic Persistence to such soul-crushing tasks, and go make a sandwich or something.
(The bug turned out to be that if I included a boolean in a class member, the whole game crashed, but only the Emscripten version. Sad. Ended up switching back to JS, which you basically need anyway for most serious web game dev.)
How do you deal with the cost associated with a long running opus session? I asked it to validate some JSON configs against the spec yesterday and it burned $10 worth of tokens for what would have been a 1 millisecond linter task.
If you look through the commit logs on simonw/research and simonw/tools on GitHub most commits should either list the prompt, link to a PR with the prompt or link to a session transcript.
I routinely leave codex running for a few hours overnight to debug stuff
If you have a deterministic unit test that can reproduce the bug through your app front door, but you have no idea how the bug is actually happening, having a coding agent just grind through the slog of sticking debug prints everywhere, testing hypotheses, etc — it's an ideal usecase
I have a hard time understanding how that would work — for me, I typically interface with coding agents through cursor. The flow is like this: ask it something -> it works for a min or two -> I have to verify and fix by asking it again; etc. until we're at a happy place with the code. How do you get it to stop from going down a bad path and never pulling itself out of it?
The important role for me, as a SWE, in the process, is verify that the code does what we actually want it to do. If you remove yourself from the process by letting it run on its own overnight, how does it know it's doing what you actually want it to do?
Or is it more like with your usecase—you can say "here's a failing test—do whatever you can to fix it and don't stop until you do". I could see that limited case working.
For some reason setting up agents in a loop with a solid prompt and new context each iteration seems to result in higher quality work for larger or more difficult tasks than the chat interface. It's like the agent doesn't have to spend half its time trying to guess what you want
Its constantly restarting itself, looking at the current state of things, re-reading what was the request, what it did and failed at in the past (at a higher level), and trying again and again.
I don't even necessarily ask it to fix the bug — just identify the bug
Like if I've made a change that is causing some unit test to fail, it can just run off and figure out where I made an off-by-one error or whatever in my change.
I've heard this said a lot but never had this problem. Claude has been decent at debugging tests since 4.0 in my experience (and much better since 4.5)
it's more like "this function is crashing with an inconsistent file format error. can you figure out how a file with the wrong format got this far into the pipeline?". in cases like that the fix is usually pretty easy once you have the one code path out of several thousands nailed down.
Or, they have freed up time for more useful endeavours, that may otherwise have spent on drudgery.
I don't discount the value of blood, sweat and tears spent on debugging those hard issues, and the lessons learned from doing so, but there is a certain point where it's OK to take a pass and just let the robots figure it out.
It's easy to say that these increasingly popular tools are only able to produce useless junk. You haven't tried, or you haven't "closed the loop" so that the agent can evaluate its own progress toward acceptance criteria, or you are monitoring incompetent feeds of other users.
I'm definitely bullish on LLM's for coding. It sounds to me as though getting it to run on its own for hours and produce something usable requires more careful thought and setup than just throwing a prompt at it and wishing for the best—but I haven't seen many examples in the wild yet
Strategy -> [ Plan -> [Execute -> FastVerify -> SlowVerify] -> Benchmark -> Learn lessons] -> back to strategy for next big step.
Claude teams and a Ralph wiggum loop can do it - or really any reasonable agent. But usually it all falls apart on either brittle Verify or Benchmark steps. What is important is to learn positive lessons into a store that survives git resets, machine blowups, etc… Any telegram bot channel will do :)
The entire setup is usually a pain to set up - docker for verification, docker for benchmark, etc… Ability to run the thing quickly, ability for the loop itself to add things , ability to do this in worktree simultaneously for faster exploration - and got help you if you need hardware to do this - for example, such a loop is used to tune and custom-fuse CUDA kernels - which means a model evaluator, big box, etc….
I am currently porting pyte to Go through a similar approach (feeding the LLM with a core SPEC and two VT100/VT220 test suites). It's chugging along quite nicely.
Anthropic is actually sort of concerned with not burning through cash and charging people a reasonable price. Open AI doesn’t care. I can use Codex CLI all day and not approach any quotas with just my $20 a month ChatGPT subscription.
I treat coding agents like junior developers and never take my hand off the wheel except for boilerplate refactoring.
The other day I got Codex to one-shot an upgrade to Vite 8 at my day job (a real website with revenue). It worked in this for over 3 hours without intervention (I went to sleep). This is now in production.
(but honestly for a lot of websites and web apps you really can just send it, the stakes are very low for a lot of what most people do, if they're honest with themselves)
I find this absolutely wild. From my experience Codex code quality is still not as good as a human so letting codex do smth and not verifying / cleaning up behind it will most likely result in lower code quality and possibly subtle bugs.
For upgrading frameworks and such there are usually not that many architectural decisions to be made, where you care about how exactly something is implemented. Here the OP could probably verify the build works, with all the expected artifacts quite easily.
Agreed. Optimistically let it resolve merge conflicts in an old complex branch. Looked fine at first but was utter slop upon further review. Duplication, wildly unnecessary complexity and all.
I have yet to see this (produce anything actually useful).