Hacker Newsnew | past | comments | ask | show | jobs | submit | grey-area's commentslogin

What do you mean, my prompts specifically ask for a phd level expert in every field?

\s


"Expertise" is a completely different beast from "knowledge".

Expecting to gain it from a model only through prompting is similar to expecting to become capable of something only because you bought a book on the topic.


This was sarcasm, sorry if that wasn’t clear.

It does matter because the LLM doesn’t always know when to use tools (e.g. ask it for sales projections which are similar to something in its weights) and is unable to reason about the boundaries of its knowledge.

To those saying this is not surprising, yes it will be surprising to the general public who are being served ads from huge companies like MS or OpenAI saying LLMs can help with their accounting, help them close deals by crunching the numbers in seconds, write complex code for them etc etc.

This is important information for anyone to understand who thinks these systems are thinking, reasoning, and learning from them or that they’re having a conversation with them i.e. 90% of users of LLMs.


> saying LLMs can help with their accounting, help them close deals by crunching the numbers in seconds, write complex code for them etc etc.

Why do you think the results of this paper contradict these claims at all?


A machine which confabulates and cannot count is not a good fit for accounting tasks. They’ll make all sorts of subtle errors which are difficult for humans to notice.

That wouldn't even necessarily be true if models really "couldn't count", since software exists - if an LLM is making an Excel spreadsheet rather than doing everything manually, it's both much harder for it to mess up and easier to notice and recover. It's even less true given that what this paper actually tests is "LLMs don't have a literally perfect accuracy when you make them do increasingly big problems with zero thinking".

(Confabulation is IMO a much bigger problem, but it's unrelated to architecture - it's an artifact of how models are currently trained.)


> general public

and the C-suite


Quick sanity check: you're susceptible to pretty irresistible optical illusions which would never fool a VLM, does it mean you're not thinking? In fact, with a non-monospaced font I also have trouble determining whether these parens are balanced, and have to select them with the mouse, i.e. use a "dumb" tool, to make sure.

Reminder that "thinking" is an ill-defined term like others, and the question whether they "think" is basically irrelevant. No intelligent system, human or machine, will ever have zero error rate, due to the very nature of intelligence (another vague term). You have to deal with that the same way you deal with it in humans - either treat bugs as bugs and build systems resilient to bugs, or accept the baseline error rate if it's low enough.


Who is hiring anyone to look at a screen to count characters? Don't be disingenuous in your argument. The apt comparison would be the current technique used to accomplish this task i.e. a pattern matching algorithm.

Can you expand on that? A separate honey pot sign up page invisible to real users, or something else?

You add "hidden" inputs to your HTML form that are named like "First Name" or "Family Name". Bots will fill them out. You will either expect them to be empty or you fill by JavaScript with sth you expect. It's of course reverse-engineerable, but does the trick.

Doesn't that break password manager autofill?

Watch out, it may break accessibility of your service. If somebody fills these fields I would add extra verification e.g. accessible CAPTCHA.

Thanks, I’ve seen scripted attacks bypass this sort of hidden input unfortunately (perhaps human assisted or perhaps just ignoring hidden fields).

They often do actually ignore truly hidden fields (input type=hidden) but if you put them "behind" an element with css, or extremely small but still rendered, many get caught. It's similar to the cheeky prompt injection attacks people did/do against LLMs.

Thanks.

Sure, it's really basic of course.

Do you test this against password managers? Seems like this approach could generate false positives

It does read as if were written on a phone but it doesn’t read like LLM text to me.

What is interesting and has possibly bled over from heavy LLM use by the author is the style of simplistic bullet point titles for the argument with filler in between. It does read like they wrote the 5 bullet points then added the other text (by hand).


Points from the article.

1. The code is garbage and this means the end of software.

Now try maintaining it.

2. Code doesn’t matter (the same point restated).

No, we shouldn’t accept garbage code that breaks e.g. login as an acceptable cost of business.

3. It’s about product market fit.

OK, but what happens after product market fit when your code is hot garbage that nobody understands?

4. Anthropic can’t defend the copyright of their leaked code.

This I agree with and they are hoist by their own petard. Would anyone want the garbage though?

5. This leak doesn’t matter

I agree with the author but for different reasons - the value is the models, which are incredibly expensive to train, not the badly written scaffold surrounding it.

We also should not mistake current market value for use value.

Unlike the author who seems to have fully signed up for the LLM hype train I don’t see this as meaning code is dead, it’s an illustration of where fully relying on generative AI will take you - to a garbage unmaintainable mess which must be a nightmare to work with for humans or LLMs.


Hey there, post author here. I think if you read deeper into my blog post history you’ll see that I have a reasonably balanced take on AI.

I generally think this will be a very important technology so I teach the subject to make sure people understand how to use it as leverage in their lives. (Yes as paid workshops, but I also volunteer weekly for 3-4 hour sessions at a non-profit where I get nothing more than the joy of helping people learn a valuable skill.)

At the same time just last week I wrote a post decrying the slop people are hoisting on their coworkers[^1], because I want people to use this technology in a positive way to create the lives they want, not to create downstream consequences for others. Ultimately I think agentic systems are incredibly powerful but also a technology that lends itself to anti-social behavior because of how independently empowering it can be. And so I hope that with the right exposure, discussion, and teaching we can take advantage of its democratizing nature, while reinforcing that what makes us special as humans is that we care and coordinate to do greater things. Value in this world — not just in the financial sense that we often boil it down to when we talk about this subject.

Hope that context helps provide a better lens into the piece, and that I still do care a lot about code and everything else that got me here, but that you are also reading personal reflections of who I am in a time of change, which is making me question (or reinforcing) some of the fundamental things I believed about software and sometimes the world more widely.

[^1]: https://build.ms/2026/3/23/workslop/


I disagree on agentic systems and AI based on LLMs and don’t feel this is close to a balanced take. You are assuming you are in the middle of a revolution when that is far from obvious. It is not yet clear this is an important technology because there are very significant limitations.

The code from one of the leading companies in the space is a good example of where the reality of what is achieved falls far short of expectations.

This is what I meant by the hype train.


Sorry, when I say I think I have a balanced take on AI what I mean is that I do my best to weigh both the pros and cons of this technology as opposed to a more extreme behavior like spending all day chatting with LLMs or posting all day on X about how AI is already better than me at everything and that jobs are over.

If I had to assign a confidence score for whether agents will change the way we all work and many aspects of how we live, I would put it at a 7/10, maybe 8/10. I felt about the same about the smartphone. While many things we do look the same way they did in 2005 (we still drive on roads, kids still go to school), at the same time it's undeniable that much of our lives are intermediated through a small screen and many societal dynamics have shifted due to that technology's existence.

I will concede that you should read my post with that context and draw your own conclusions about the veracity of my perspective — but I think it is more well-reasoned than what people generally attribute to "LLM hype". (Of course it's a bit tautological that I believe that, but I try to surround myself with people of all kinds technical and non-technical and like to think I stay reasonably grounded.)

All that said, I think the code from a leading company being bad and yet delivering good results is more a sign of the technology's jagged frontier[^1]. Calculators can't write sonnets the same way that LLMs are bad at math, but that doesn't make them useless — it just makes them a tool. This is a tool in our tool belt and I find is surprisingly useful as a general purpose technology despite it's limitations. (Which is related to the main argument I make in the post that bad code leading to good results may imply that we're under and overweighting certain aspects of what is important in software development, and that our expectations of code may may need to be recalibrated often as we gather more evidence.)

[^1]: https://www.oneusefulthing.org/p/centaurs-and-cyborgs-on-the...


I'm a little disappointed that a bunch of engineers with unlimited access to Opus didn't do a better job.

What exactly makes you say that "the author who seems to have fully signed up for the LLM hype train"?

I feel the author is just stating the obvious: code quality has very little to do with whether a product succeeds


OK, but what happens after product market fit when your code is hot garbage that nobody understands?

This is a different question, but obviously, a code that "nobody understands" is a terrible situation

In practice (e.g. Go) it’s actually pretty good and infinitely preferable to third party everything.

LLMs are so so far away from being able to independently work on a large codebase, and why would they not benefit from modularity and clarity too?

I agree the functions in a file should probably be reasonably-sized.

It's also interesting to note that due to the way round-tripping tool-calls work, splitting code up into multiple files is counter-productive. You're better off with a single large file.


> due to the way round-tripping tool-calls work, splitting code up into multiple files is counter-productive.

Can you expand on that?


> independently work on a large codebase

Im not sure that Humans are great at this either. Think about how we use frameworks and have complex supply chains... we sort of get "good enough" at what we need to do and pray a lot that everything else keeps working and that our tooling (things like artifactory) save us from supply chain attacks. Or we just run piles of old, outdated code because "it works". I cant tell you how many micro services I have seen that are "just fine" but no one in the current org has ever read a line of what's in them, and the people who wrote them left ages ago.

> clarity too

Yes, but define clarity!

I recently had the pleasure of fixing a chunk of code that was part of a data pipeline. It was an If/elseif/elseif structure... where the final two states were fairly benign and would have been applicable in 99 percent of cases. Everything else was to deal with the edge cases!

I had an idea of where the issue was, but I didn't understand how the code ended up in the state it was in... Blame -> find the commit message (references ticket) -> find the Jira ticket (references sales force) -> find the original customer issue in salesforce, read through the whole exchange there.

A two line comment could have spared me all that work, to get to what amounted to a dead simple fix. The code was absolutely clear, but without the "why" portion of the context I likely would have created some sort of regression, that would have passed the good enough testing that was there.

I re-wrote a portion of the code (expanding variable names) - that code is now less "scannable" and more "readable" (different types of clarity). Dropped in comments: a few sentences of explaining, and references to the tickets. Went and updated tests, with similar notes.

Meanwhile, elsewhere (other code base, other company), that same chain is broken... the "bug tracking system" that is referenced in the commit messages there no longer exists.

I have a friend who, every time he updates his dev env, he calls me to report that he "had to go update the wiki again!" Because someone made a change and told every one in a slack message. Here is yet another vast repository of degrading, unsearchable and unusable tribal knowledge embedded in so many organizations out there.

Don't even get me started on the project descriptions/goals/tasks that amount to pantomime a post-it notes, absent of any sort of genuine description.

Lack of clarity is very much also a lack of "context" in situ problem.


I think humans are pretty good at it with small teams and the right structure. There are definitely dysfunctional orgs as you describe where humans produce garbage code yes. I blame the org for that, not the humans.

As to what defines clarity, yes of course, like the word quality this is very hard to define, but we can certainly recognise when it was not considered.

I think it is a goal worth striving for though, and abandoning code standards because we now have AI helpers is stupid and self-defeating, even if we think they are very capable and will improve.

The end of history has not in fact arrived with generative AI, we still have to maintain software after.


Most of inner London doesn't need a car at all.

Now they have 13k lines of someone else’s mess (the AIs) to manage instead.

But this is a different kind of problem.

With legacy systems, at least the complexity was somewhat anticipated early in the design process (even if it was incorrect).

With automatically generated code, you get something that "works" but with a much vaguer underlying model, which makes it harder to understand when things start to go wrong.

In both cases, the real cost comes later, when you're forced to debug under pressure.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: