> Our latest frontier models have shown particular strengths in their ability to...

simonw · 2026-02-12T18:56:30 1770922590

How hard have you tried?

I've been finding that the Opus 4.5/4.6 and GPT-5.2/5.3 models really have represented a step-change in how good they are at running long tasks.

I can one-shot prompt all sorts of useful coding challenges now that previously I would have expected to need multiple follow-ups to fix mistakes the agents made.

I got all of this from a single prompt, for example: https://github.com/simonw/research/tree/main/cysqlite-wasm-w... - including this demo page: https://simonw.github.io/research/cysqlite-wasm-wheel/demo.h... - using this single prompt: https://github.com/simonw/research/pull/79

aeyes · 2026-02-12T19:03:08 1770922988

What do you mean? The generated script just downloads the sources and runs pyodide: https://github.com/simonw/research/blob/main/cysqlite-wasm-w...

There is maybe 5 relevant lines in the script and nothing complex at all that would require to run for days.

simonw · 2026-02-12T19:29:07 1770924547

No, not for days - but it churned away on that one for about ten minutes.

I don't think I've got any examples of multi-hour or multi-day sessions that ran completely uninterrupted - this one back in December took 4.5 hours but I had to prompt it to keep going a few times along the way: https://simonwillison.net/2025/Dec/15/porting-justhtml/

AntiRush · 2026-02-13T04:00:11 1770955211

This was a 24 hour task from a single prompt, GPT-5.2

https://tomisin.space/projects/graph-easy-ts/

andai · 2026-02-12T19:30:11 1770924611

Maybe so, but I did once spend 12 hours straight debugging an Emscripten C++ compiler bug! (After spending the first day of the jam setting up Emscripten, and the second day getting Raylib to compile in it. Had like an hour left to make the actual game, hahah.)

I am a bit thick with such things, but just wanted to provide the context that Emscripten can be a fickle beast :)

I sure am glad I can now deploy Infinite Mechanized Autistic Persistence to such soul-crushing tasks, and go make a sandwich or something.

(The bug turned out to be that if I included a boolean in a class member, the whole game crashed, but only the Emscripten version. Sad. Ended up switching back to JS, which you basically need anyway for most serious web game dev.)

citizenpaul · 2026-02-15T19:29:39 1771183779

How do you deal with the cost associated with a long running opus session? I asked it to validate some JSON configs against the spec yesterday and it burned $10 worth of tokens for what would have been a 1 millisecond linter task.

simonw · 2026-02-15T20:01:13 1771185673

I'm on the $200/month Claude Max plan and I rarely run out of my token allowance.

I'm also paying $20/month for OpenAI Codex and again it's rare I hit the rate limits there.

basilgohar · 2026-02-12T19:04:08 1770923048

Can you share any examples of these one-shot prompts? I've not gotten to the point where I can get those kind of results yet.

simonw · 2026-02-12T19:45:04 1770925504

If you look through the commit logs on simonw/research and simonw/tools on GitHub most commits should either list the prompt, link to a PR with the prompt or link to a session transcript.

gamegoblin · 2026-02-12T18:50:05 1770922205

I routinely leave codex running for a few hours overnight to debug stuff

If you have a deterministic unit test that can reproduce the bug through your app front door, but you have no idea how the bug is actually happening, having a coding agent just grind through the slog of sticking debug prints everywhere, testing hypotheses, etc — it's an ideal usecase

nikkwong · 2026-02-12T19:19:25 1770923965

I have a hard time understanding how that would work — for me, I typically interface with coding agents through cursor. The flow is like this: ask it something -> it works for a min or two -> I have to verify and fix by asking it again; etc. until we're at a happy place with the code. How do you get it to stop from going down a bad path and never pulling itself out of it?

The important role for me, as a SWE, in the process, is verify that the code does what we actually want it to do. If you remove yourself from the process by letting it run on its own overnight, how does it know it's doing what you actually want it to do?

Or is it more like with your usecase—you can say "here's a failing test—do whatever you can to fix it and don't stop until you do". I could see that limited case working.

woah · 2026-02-12T19:35:51 1770924951

For some reason setting up agents in a loop with a solid prompt and new context each iteration seems to result in higher quality work for larger or more difficult tasks than the chat interface. It's like the agent doesn't have to spend half its time trying to guess what you want

vel0city · 2026-02-12T20:17:53 1770927473

You do things like ralph loops.

https://github.com/snarktank/ralph

Its constantly restarting itself, looking at the current state of things, re-reading what was the request, what it did and failed at in the past (at a higher level), and trying again and again.

gamegoblin · 2026-02-12T22:23:11 1770934991

I use Codex CLI or Claude Code

I don't even necessarily ask it to fix the bug — just identify the bug

Like if I've made a change that is causing some unit test to fail, it can just run off and figure out where I made an off-by-one error or whatever in my change.

p1esk · 2026-02-12T19:45:07 1770925507

“here's a failing test—do whatever you can to fix it”

Bad idea. It can modify the code that the test passes but everything else is now broken.

SatvikBeri · 2026-02-12T22:36:01 1770935761

I've heard this said a lot but never had this problem. Claude has been decent at debugging tests since 4.0 in my experience (and much better since 4.5)

zem · 2026-02-12T21:41:40 1770932500

it's more like "this function is crashing with an inconsistent file format error. can you figure out how a file with the wrong format got this far into the pipeline?". in cases like that the fix is usually pretty easy once you have the one code path out of several thousands nailed down.

tsss · 2026-02-12T19:01:40 1770922900

How can you afford that?

wahnfrieden · 2026-02-12T19:03:37 1770923017

It costs $200 for a month

addaon · 2026-02-12T19:23:00 1770924180

> it's an ideal usecase

This is impressive, you’ve completely mitigated the risk of learning or understanding.

arcanemachiner · 2026-02-12T19:27:35 1770924455

Or, they have freed up time for more useful endeavours, that may otherwise have spent on drudgery.

I don't discount the value of blood, sweat and tears spent on debugging those hard issues, and the lessons learned from doing so, but there is a certain point where it's OK to take a pass and just let the robots figure it out.

wahnfrieden · 2026-02-12T19:01:29 1770922889

It worked for me several times.

It's easy to say that these increasingly popular tools are only able to produce useless junk. You haven't tried, or you haven't "closed the loop" so that the agent can evaluate its own progress toward acceptance criteria, or you are monitoring incompetent feeds of other users.

nikkwong · 2026-02-12T19:25:48 1770924348

I'm definitely bullish on LLM's for coding. It sounds to me as though getting it to run on its own for hours and produce something usable requires more careful thought and setup than just throwing a prompt at it and wishing for the best—but I haven't seen many examples in the wild yet

foobar10000 · 2026-02-12T20:16:10 1770927370

It needs a closed loop.

Strategy -> [ Plan -> [Execute -> FastVerify -> SlowVerify] -> Benchmark -> Learn lessons] -> back to strategy for next big step.

Claude teams and a Ralph wiggum loop can do it - or really any reasonable agent. But usually it all falls apart on either brittle Verify or Benchmark steps. What is important is to learn positive lessons into a store that survives git resets, machine blowups, etc… Any telegram bot channel will do :)

The entire setup is usually a pain to set up - docker for verification, docker for benchmark, etc… Ability to run the thing quickly, ability for the loop itself to add things , ability to do this in worktree simultaneously for faster exploration - and got help you if you need hardware to do this - for example, such a loop is used to tune and custom-fuse CUDA kernels - which means a model evaluator, big box, etc….

wahnfrieden · 2026-02-12T21:00:35 1770930035

I do it easily just by asking Codex

rcarmo · 2026-02-12T20:27:42 1770928062

well, you can start with https://github.com/rcarmo/go-textile, https://github.com/rcarmo/go-rdp, https://github.com/rcarmo/go-ooxml, https://github.com/rcarmo/go-busybox (still WIP). All of these are essentially SPEC and test-driven and they are all working for me (save a couple of bugs in go-rdp I need to fix myself, and some gaps in the ECMA specs for go-ooxml that require me to provide actual manually created documents for further testing).

I am currently porting pyte to Go through a similar approach (feeding the LLM with a core SPEC and two VT100/VT220 test suites). It's chugging along quite nicely.

XCSme · 2026-02-12T18:44:38 1770921878

Their ability to burn through tokens non-stop for hours, days or weeks without intervention.

raw_anon_1111 · 2026-02-12T19:13:49 1770923629

You’re mixing up Open AI for Anthropic.

Anthropic is actually sort of concerned with not burning through cash and charging people a reasonable price. Open AI doesn’t care. I can use Codex CLI all day and not approach any quotas with just my $20 a month ChatGPT subscription.

I treat coding agents like junior developers and never take my hand off the wheel except for boilerplate refactoring.

TheMuenster · 2026-02-12T21:20:52 1770931252

Can I just say how funny this metric is?

"Our model is so slow and our tokens/second is so low that these tasks can take hours!" is not the advertising they think it is.

johnfn · 2026-02-12T19:28:55 1770924535

The other day I got Codex to one-shot an upgrade to Vite 8 at my day job (a real website with revenue). It worked in this for over 3 hours without intervention (I went to sleep). This is now in production.

seunosewa · 2026-02-12T20:16:54 1770927414

How did you verify it?

girvo · 2026-02-12T21:45:16 1770932716

Just send it bro

(but honestly for a lot of websites and web apps you really can just send it, the stakes are very low for a lot of what most people do, if they're honest with themselves)

johnfn · 2026-02-13T02:01:51 1770948111

Uhhh, this is my work, so… we didn’t have a SEV? None of our thousands of customers paying us money reported the site was broken?

ghosty141 · 2026-02-13T10:07:36 1770977256

I find this absolutely wild. From my experience Codex code quality is still not as good as a human so letting codex do smth and not verifying / cleaning up behind it will most likely result in lower code quality and possibly subtle bugs.

tinodb · 2026-02-14T12:37:11 1771072631

For upgrading frameworks and such there are usually not that many architectural decisions to be made, where you care about how exactly something is implemented. Here the OP could probably verify the build works, with all the expected artifacts quite easily.

mikojan · 2026-02-13T10:01:40 1770976900

Agreed. Optimistically let it resolve merge conflicts in an old complex branch. Looked fine at first but was utter slop upon further review. Duplication, wildly unnecessary complexity and all.

bitwize · 2026-02-12T19:38:26 1770925106

PEBKAC