Hacker Newsnew | past | comments | ask | show | jobs | submit | jpcompartir's commentslogin

Anthropic releases used to feel thorough and well done, with the models feeling immaculately polished. It felt like using a premium product, and it never felt like they were racing to keep up with the news cycle, or reply to competitors.

Recently that immaculately polished feel is harder to find. It coincides with the daily releases of CC, Desktop App, unknown/undocumented changes to the various harnesses used in CC/Cowork. I find it an unwelcome shift.

I still think they're the best option on the market, but the delta isn't as high as it was. Sometimes slowing down is the way to move faster.


Boris from the Claude Code team here. We agree, and will be spending the next few weeks increasing our investment in polish, quality, and reliability. Please keep the feedback coming.

> investment in polish, quality, and reliability

For there to be any trust in the above, the tool needs to behave predictably day to day. It shouldn't be possible to open your laptop and find that Claude suddenly has an IQ 50 points lower than yesterday. I'm not sure how you can achieve predictability while keeping inference costs in check and messing with quantization, prompts, etc on the backend.

Maybe a better approach might be to version both the models and the system prompts, but frequently adjust the pricing of a given combination based on token efficiency, to encourage users to switch to cheaper modes on their own. Let users choose how much they pay for given quality of output though.


Sure, I've cancelled my Max 20 subscription because you guys prioritize cutting your costs/increasing token efficiency over model performance. I use expensive frontier labs to get the absolute best performance, else I'd use an Open Source/Chinese one.

Frontier LLMs still suck a lot, you can't afford planned degradation yet.


My biggest problem with CC as a harness is that I can't trust "Plan" mode. Long running sessions frequently start bypassing plan mode and executing, updating files and stuff, without permission, while still in plan mode. And the only recovery seems to be to quit and reload CC.

Right now my solution is to run CC in tmux and keep a 2nd CC pane with /loop watching the first pane and killing CC if it detects plan mode being bypassed. Burning tokens to work around a bug.


Here's one person's feedback. After the release of 4.7, Claude became unusable for me in two ways: frequent API timeouts when using exactly the same prompts in Claude Code that I had run problem-free many times previously, and absurdly slow interface response in Claude Cowork. I found a solution to the first after a few days (add "CLAUDE_STREAM_IDLE_TIMEOUT_MS": "600000" to settings.json), but as of a few hours ago Cowork--which I had thought was fantastic, by the way--was still unusable despite various attempts to fix it with cache clearing and other hacks I found on the web.

hm. ml people love static evals and such, but have you considered approaches that typically appear in saas? (slow-rollouts, org/user constrained testing pools with staged rollouts, real-world feedback from actual usage data (where privacy policy permits)?


> Please keep the feedback coming

if only there were a place with 9.881 feedbacks waiting to be triaged...

and that maybe not by a duplicate-bot that goes wild and just autocloses everything, just blessing some of the stuff there with a "you´ve been seen" label would go a long way...


Common pattern of checking the claude code issue tracker for a bug: land on issue #12587, auto closed as duplicate of #12043; check #12043, auto closed as duplicated of #11657; check #11657, auto closed as duplicate of #10645; check #10645, never got a response, or closed as not planned, or some other bullshit.

I am considering proving my feedback by not providing my money any longer.

Why ban third party wrappers? All of this could've been sidestepped had you not banned them.

Because then they lose vertical integration and the extra ability it grants to tune settings to reduce costs / token use / response time for subscription users.

Or improve performance and efficiency, if we’re generous and give them the benefit of the doubt.

It makes sense, in a way. It means the subscription deal is something along the lines of fixed / predictable price in exchange for Anthropic controlling usage patterns, scheduling, throttling (quotas consumptions), defaults, and effective workload shape (system prompt, caching) in whatever way best optimises the system for them (or us if, again, we’re feeling generous) / makes the deal sustainable for them.

It’s a trade-off


They gained that ability to tune settings and then promptly used it in a poor way and degraded customer experience.

That’s what we see.

It may be (but I wouldn’t know) that some of other changes not covered here reduced costs on their side without impacting users, improving the viability of their subscription model. Or maybe even improved things for users.

I’d really appreciate more transparency on this, and not just when things fail.

But I’ve learned my lesson. I’ve been weening off Claude for a few weeks, cancelled my subscription three weeks ago, let it expire yesterday, and moved to both another provider and a third-party open source harness.


Nothing you wrote makes sense. The limits are so Anthropic isn't on a loss. If they can customize Claude using Code, I see no reason why they couldn't do so with other wrappers. Other wrappers can also make use of cache.

If you worry about "degraded" experience, then let people choose. People won't be using other wrappers if they turn out to be bad. People ain't stupid.


By imposing the use of their harness, they control the system prompt:

> On April 16, we added a system prompt instruction to reduce verbosity. In combination with other prompt changes, it hurt coding quality, and was reverted on April 20. This impacted Sonnet 4.6, Opus 4.6, and Opus 4.7

They can pick the default reasoning effort:

> On March 4, we changed Claude Code's default reasoning effort from high to medium to reduce the very long latency—enough to make the UI appear frozen—some users were seeing in high mode

They can decide what to keep and what to throw out (beyond simple token caching):

> On March 26, we shipped a change to clear Claude's older thinking from sessions that had been idle for over an hour, to reduce latency when users resumed those sessions. A bug caused this to keep happening every turn for the rest of the session instead of just once, which made Claude seem forgetful and repetitive. We fixed it on April 10. This affected Sonnet 4.6 and Opus 4.6

It literally is all in the post.

I don't worry about anything though. It's not my product. I don't work for Anthropic, so I really couldn't care less about anyone else's degraded (or not) experience.


> they control the system prompt

They control the default system prompt. You can change it if you want to.

> They can pick the default reasoning effort

Don't see how it's an obstacle in allowing third party wrappers.

> They can decide what to keep and what to throw out

That's actually a good point. However I still don't think it's an obstacle. If third party wrappers were bad, people simply wouldn't be using them.


Evidently, all these things you just dismissed matter, else all the changes I quoted from the original post wouldn’t have affected anyone, or half as many people, or half as much. Anthropic wouldn’t have had any complaints to investigate, the article promoting this entire thread wouldn’t exist, and we wouldn’t be having this very conversation.

Defaults matter. A large share of people never change them (status quo bias, psychological inertia). Having control over them (and usage quotas) means Anthropic can control and fine-tune what this fixed subscription costs them.

And evidently (re, the original article), they tried to do so.


Edit: the article prompting this entire thread.

> Defaults matter. A large share of people never change them (status quo bias, psychological inertia). Having control over them (and usage quotas) means Anthropic can control and fine-tune what this fixed subscription costs them.

Allowing third party wrappers doesn't mean Claude Code would cease to exist. The opposite actually, Claude Code would be the default.

People dissatisfied with Code would simply use other wrappers. I call it a win-win. Don't see how Anthropic would be on a lose here, they would still retain the ability to control the defaults.


Except one of the major other wrappers was pi, through OpenClaw. With countless hundreds of thousands of instances running every hour on that heartbeat.

I have no idea what the share of OpenClaw instances running on pi was, or third-party wrappers in general, but it was obviously large enough that Anthropic decided they had to put an end to it.

Conversely, from the latest developments, it would seem they are perfectly fine with people running OpenClaw with Claude models through Claude Code’s programmatic interface using subscriptions.

But in the end, this, my take, your take, is all conjecture. We are both on the outside looking in.

Only the people who work at Anthropic know.


And you didn't invest anything in polish, quality and reliability before... why? Because for any questions people have you reply something like "I have Claude working on this right now" and have no idea what's happening in the code?

A reminder: your vibe-coded slop required peak 68GB of RAM, and you had to hire actual engineers to fix it.


I think you're being a bit harsh.

... But then again, many of us are paying out of pocket $100, $200USD a month.

Far more than any other development tools.

Services that cost that much money generally come with expectations.


Here's Jared Sumner of bun saying they reduced peak consumption from 68GB to 1.7GB: https://x.com/jarredsumner/status/2026497606575398987 Anthropic had acquired bun just 3 months prior.

A month prior their vibe-coders was unironically telling the world how their TUI wrapper for their own API is a "tiny game engine" as they were (and still are) struggling to output a couple of hundred of characters on screen: https://x.com/trq212/status/2014051501786931427

Meanwhile Boris: "Claude fixes most bugs by itself. " while breaking the most trivial functionality all the time: https://x.com/bcherny/status/2030035457179013235 https://x.com/bcherny/status/2021710137170481431 https://x.com/bcherny/status/2046671919261569477 https://x.com/bcherny/status/2040210209411678369 while claiming they "test carefully": https://x.com/bcherny/status/2024152178273989085


Yeah you don't have to convince me. I switched to Codex mid-January in part because of the dubious quality of the tui itself and the unreliability of the model. Briefly switched back through March, and yep, still a mistake.

Once OpenAI added the $100 plan, it was kind of a no-brainer.


I've noticed the same thing in my own AI assisted work. Feels like I'm moving too fast and it's easy to implement decisions quickly but they really have to be the right f--ing decisions. In the past dev was so slow so you had a lot of time to vet the hard decisions and now you don't.

> It felt like using a premium product, and it never felt like they were racing to keep up with the news cycle, or reply to competitors.

I don't know, their desktop app felt really laggy and even switching Code sessions took a few seconds of nothing happening. Since the latest redesign, however, it's way better, snappy and just more usable in most respects.

I just think that we notice the negative things that are disruptive more. Even with the desktop app, the remaining flaws jump out: for example, how the Chat / Cowork / Code modes only show the label for the currently selected mode and the others are icons (that aren't very big), a colleague literally didn't notice that those modes are in the desktop app (or at least that that's where you switch to them).


Given the price I don't really think they're the best option. They're sloppy and competitors are catching up. I'm having same results with other models, and very close with Kimi, which is waaay cheaper.

I agree. It all feels so AI-slopy now.

I guess it's a bit of desperation to find a sustainable business model.

The AI hype is dying, at least outside the silicon valley bubble which hackernews is very much a part of.

That and all the dogfooding by slop coding their user facing application(s).


Likewise, I foolishly assumed everybody else was just doing it wrong.

But this week I've lost count of the times I've had to say something along the lines of: "Can you check our plan/instructions, I'm pretty sure I said we need to do [this thing] but you've done [that thing]..."

And get hit with a "You're absolutely right...", which virtually never happened for me. I think maybe once since Opus 4-6.


Honestly, I thought it was a skill issue too, but it just turns out I wasn't using it enough.

I started a new job recently, so I'm asking it a lot of questions about the codebase, sometimes just to confirm my understanding and often it came up with wrong conclusions that would send me down rabbit holes only to find out it was wrong.

On a side project I gave it literally a formula and told it to run it with some other parameters. It was doing its usual "let me get to know the codebase" then a "I have a good understanding of the codebase" speech, only to follow it up with "what you're asking is not possible" I'm like... No, I know it's possible I implemented it already, just use it in more places only to get the same "o ye ur right, I missed that... Blabla"

Yeah, it's gotten pretty bad...


maybe a consequence of saving GPU for newer models? Also tuning effort level suppose to help, haven't get enough dp on this though

They track our frustration, which is probably really good coding data. The reason why it's painful is because that's data annotation, it's literally a job people get paid to do, yet we're paying to do it. If they need good data, they just turn the models to shit and gaslight everyone

This looks like a Claude-generated SVG to me, is it not?


It's 100% claude-generated html. I asked it to create some other cheat sheet for me and the template was identical.

Edit: https://news.ycombinator.com/item?id=47495528


There are better techniques for hyper-parameter optimisation, right? I fear I have missed something important, why has Autoresearch blown up so much?

The bottleneck in AI/ML/DL is always data (volume & quality) or compute.

Does/can Autoresearch help improve large-scale datasets? Is it more compute efficien than humans?


There is a field of AutoML, with its own specialized academic literature and libraries that tried to achieve this type of thing but didn't work very well in practice.

Years ago there were big hopes about bayesian hyperparameter optimization, predicting performance with Gaussian processes etc, hyperopt library, but it was often starting wasteful experiments because it really didn't have any idea what the parameters did. People mostly just do grid search and random search with a configuration that you set up by intuition and experience. Meanwhile LLMs can see what each hyperparameter does, it can see what techniques and settings have worked in the literature, it can do something approximating common sense regarding what has a big enough effect. It's surprisingly difficult to precisely define when a training curve has really flattened for example.

So in theory there are many non-LLM approaches but they are not great. Maybe this is also not so great yet. But maybe it will be.


AFAIK, it's a bit more than hyper-parameter tuning as it can also make non-parametric (structural) changes.

Non-parametric optimization is not a new idea. I guess the hype is partly because people hope it will be less brute force now.


It's an LLM-powered evolutionary algorithm.


I'd like see a system like this take more inspiration from the ES literature, similar to AlphaEvolve. Let's see an archive of solutions, novelty scoring and some crossover rather than purely mutating the same file in a linear fashion.


Exactly, that's the way forward.

There are lots of old ideas from evolutionary search worth revisiting given that LLMs can make smarter proposals.


That was my impression. Including evolutionary programming which normally would happen at the AST level, with the LLM it can happen at the source level.


Perhaps LLM-guided Superoptimization: <https://en.wikipedia.org/wiki/Superoptimization>

I recall reading about a stochastic one years ago: <https://github.com/StanfordPL/stoke>


> There are better techniques for hyper-parameter optimisation, right?

Yes, for example "swarm optimization".

The difference with "autoresearch" (restricting just to the HPO angle) is that the LLM may (at least we hope) beat conventional algorithmic optimization by making better guesses for each trial.

For example, perhaps the problem has an optimization manifold that has been studied in the past and the LLM either has that study in its training set or finds it from a search and learns the relative importance of all the HP axes. Given that, it "knows" not to vary the unimportant axes much and focus on varying the important ones. Someone else did the hard work to understand the problem in the past and the LLM exploits that (again, we may hope).


> The bottleneck in AI/ML/DL is always data (volume & quality) or compute.

Not true at all. The whole point of ML is to find better mappings from X to Y, even for the same X.

Many benchmarks can’t be solved by just throwing more compute at the problem. They need to learn better functions which traditionally requires humans.

And sometimes an algorithm lets you tap into more data. For example transformers had better parallelism than LSTMs -> better compute efficiency.


Fair push back, but I do think the LSTM vs Transformers point kinda supports my position in the limit, not refutes. Once the compute bottleneck is removed, LSTMs scale favourably. https://arxiv.org/pdf/2510.02228 (I believe there's similar work done on vanilla LSTMs, but I'd have to go digging)

So the bottleneck was compute. Which is compatible with 'data or compute'. But to accept your point, at the time the algorothmic advances were useful/did unlock/remove the bottleneck.

A wider point is that eventually (once compute and data are scaled enough) the algorithms are all learning the same representations: https://arxiv.org/pdf/2405.07987

And of course the canon: https://nonint.com/2023/06/10/the-it-in-ai-models-is-the-dat... http://www.incompleteideas.net/IncIdeas/BitterLesson.html

Scaling compute & data > algorithmic cleverness


Algorithms do matter because compute is not unlimited in practice. Otherwise we might as well use bogo sort because the result is eventually the same. Yes the platonic ideal of a sorted list looks the same but that doesn’t tell you anything about how to get there or whether you can in this lifetime.

I bring up transformers because scaling compute and data was unlocked by a better algorithm. It matters a lot because scaling compute isn’t always an option.


> There are better techniques for hyper-parameter optimisation, right?

There always are. You need to think about what those would be, though. Autoresearch outsources the thinking to LLMs.


"Regardless, these threats do not change our position: we cannot in good conscience accede to their request."


Yes, that is great, for people from the US. For people in Europe and other locations, this just proves that they dont really care as the tool is already being used against us. It quite clear to me that anyone outside the US should immediately cancel all contracts with these corporations, as well as work their hardest at blocking their bots online.


As a non-US citizen, I'm quite glad in the knowledge that Claude won't be used to kill other non-US citizens with autonomous weapons


This is great, brings clear benefits to both sides and the rest of us.

Always rooting for Hugging Face


Yep, Gemini is virtually unusable compared to Anthropic models. I get it for free with work and use maybe once a week, if that. They really need to fix the instruction following.


Thanks for the long and considered response, but this is a really ugly UX decision.

As others have said - 'reading 10 files' is useless information - we want to be able to see at a glance where it is and what it's doing, so that we can re-direct if necessary.

With the release of Cowork, couldn't Claude Code double down on needs of engineers?


This is great, not 10 minutes before this outage did I present Railway as a viable option for some small-scale hosting for prototypes and non-critical apps as an alternative to the Cloud giants


It always happens that way. I guarantee some people migrated from Heroku to Railway and bragged about future stability to the team, only to experience this.


Yeah 100%

This won't change my decision, but it is still impeccable timing


4.6 is a beast.

Everything in plan mode first + AskUserQuestionTool, review all plans, get it to write its own CLAUDE.md for coding standards and edit where necessary and away you go.

Seems noticeably better than 4.5 at keeping the codebase slim. Obviously it still needs to be kept an eye on, but it's a step up from 4.5.


Not clearly a step up for me, it's way more hesitant it seems and I don't notice context being larger at all it seems to compact just as often.


Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: