Hacker Newsnew | past | comments | ask | show | jobs | submit | fred_is_fred's commentslogin

I'm not sure passive income is the right way to describe a drop shipping operation where you have to interact with customers. Passive income is $VT.

How does this compare to the commercial models like Sonnet 4.5 or GPT? Close enough that the price is right (free)?

The will not measure up. Notice they're comparing it to Gemma, Google's open weight model, not to Gemini, Sonnet, or GPT. That's fine - this is a tiny model.

If you want something closer to the frontier models, Qwen3.6-Plus (not open) is doing quite well[1] (I've not tested it extensively personally):

https://qwen.ai/blog?id=qwen3.6


on the bright side also worth to keep in mind those tiny models are better than GPT 4.0, 4.1 GPT4o that we used to enjoy less than 2 years ago [1]

[1] https://artificialanalysis.ai/?models=gpt-5-4%2Cgpt-oss-120b...


They're absolutely worth using for the right tasks. It's hard to go back to GPT4 level for everything (for me at least), but there's plenty of stuff they are smart enough for.

> Close enough

No. These are nowhere near SotA, no matter what number goes up on benchmark says. They are amazing for what they are (runnable on regular PCs), and you can find usecases for them (where privacy >> speed / accuracy) where they perform "good enough", but they are not magic. They have limitations, and you need to adapt your workflows to handle them.


Can you share more about what adaptations you made when using smaller models?

I'm just starting my exploration of these small models for coding on my 16GB machine (yeah, puny...) and am running into issues where the solution may very well be to reduce the scope of the problem set so the smaller model can handle it.


You'd do most of the planning/cognition yourself, down to the module/method signature level, and then have it loop through the plan to "fill in the code". Need a strong testing harness to loop effectively.

It is very unlikely that general claims about a model are useful, but only very specific claims, which indicate the exact number of parameters and quantization methods that are used by the compared models.

If you perform the inference locally, there is a huge space of compromise between the inference speed and the quality of the results.

Most open weights models are available in a variety of sizes. Thus you can choose anywhere from very small models with a little more than 1B parameters to very big models with over 750B parameters.

For a given model, you can choose to evaluate it in its native number size, which is normally BF16, or in a great variety of smaller quantized number sizes, in order to fit the model in less memory or just to reduce the time for accessing the memory.

Therefore, if you choose big models without quantization, you may obtain results very close to SOTA proprietary models.

If you choose models so small and so quantized as to run in the memory of a consumer GPU, then it is normal to get results much worse than with a SOTA model that is run on datacenter hardware.

Choosing to run models that do not fit inside the GPU memory reduces the inference speed a lot, and choosing models that do not fit even inside the CPU memory reduces the inference speed even more.

Nevertheless, slow inference that produces better results may reduce the overall time for completing a project, so one should do a lot of experiments to determine an appropriate compromise.

When you use your own hardware, you do not have to worry about token cost or subscription limits, which may change the optimal strategy for using a coding assistant. Moreover, it is likely that in many cases it may be worthwhile to use multiple open-weights models for the same task, in order to choose the best solution.

For example, when comparing older open-weights models with Mythos, by using appropriate prompts all the bugs that could be found by Mythos could also be found by old models, but the difference was that Mythos found all the bugs alone, while with the free models you had to run several of them in order to find all bugs, because all models had different strengths and weaknesses.

(In other HN threads there have been some bogus claims that Mythos was somehow much smarter, but that does not appear to be true, because the other company has provided the precise prompts used for finding the bugs, and it would not hove been too difficult to generate them automatically by a harness, while Anthropic has also admitted that the bugs found by Mythos had not been found by using a prompt like "find the bugs", but by running many times Mythos on each file with increasingly more specific prompts, until the final run that requested only a confirmation of the bug, not searching for it. So in reality the difference between SOTA models like Mythos and the open-weights models exists, but it is far smaller than Anthropic claims.)


Thank you, I've been doing guided exploration of the various quantized models with the help of Gemini (which is highly ironic, but effective.)

It does seem like 16GB is on the extreme lower end of being able to produce capable results, very much like a junior dev, so much oversight is needed.

A tight code-test-fix loop seems to be the way forward.


> Anthropic has also admitted that the bugs found by Mythos had not been found by using a prompt like "find the bugs", but by running many times Mythos on each file with increasingly more specific prompts, until the final run that requested only a confirmation of the bug, not searching for it.

Unless there's been more information since their original post (https://red.anthropic.com/2026/mythos-preview/), this is a misleading description of the scaffold. The process was:

- provide a container with running software and its source code

- prompt Mythos to prioritize source files based on the likelihood they contain vulnerabilities

- use this prioritization to prompt parallel agents to look for and verify vulnerabilities, focusing on but not limited to a single seed file

- as a final validation step, have another instance evaluate the validity and interestingness of the resulting bug reports

This amounts to at most three invocations of the model for each file, once for prioritization, once for the main vulnerability run, and once for the final check. The prompts only became more specific as a result of information the model itself produced, not any external process injecting additional information.


I think its worth noting that if you are paying for electricity Local LLM is NOT free. In most cases you will find that Haiku is cheaper, faster, and better than anything that will run on your local machine.

Electricity (on continental US) is pretty cheap assuming you already have the hardware:

Running at a full load of 1000W for every second of the year, for a model that produces 100 tps at 16 cents per kWh, is $1200 USD.

The same amount of tokens would cost at least $3,150 USD on current Claude Haiku 3.5 pricing.


This 35B-A3B model is 4-5x cheaper than Haiku though, suggesting it would still be cheaper to outsource inference to the cloud vs running locally in your example

If you need the heating then it is basically free.

Only if you use resistive electric heating, which is usually the most expensive heating available.

They were basically bankrupt 2 weeks ago and sold for $39M. The stock pop shows me that we're in a bubble of irrational exuberance. And what are they going to to with 50M? Buy a few racks of servers?

Can someone take this and build a working 6502 binary that creates a SQL database?

Chris has been so focused for 5 years that he had no clue about anything else going on in the industry?

It reads to me like the author thinks Chris is an idiot for not selling his automation tech to the defense industry

Makes me wonder if a respirator epidemic that killed people could also have a destabilizing impact on human society. I guess we may never know.

>If you cut out the vulnerable code from Heartbleed and just put it in front of a C programmer, they will immediately flag it. It's obvious.

Genuinely curious - why couldn't a static analyzer also find the issue then? Those have been worked on for 50+ years at this point, maybe longer.


The whole idea of someone's code being perfectly handcrafted may have been true in 1998, but any project you start now builds on a tower of open source libraries frameworks, and container images - probably running on someone else's infra. Nobody is really starting from a blank page anymore.

Is it down? The start and skip button both dont work and I see this error in my console.

Manifest fetch from https://www.convexly.app/manifest.json failed, code 403


Just checked and everything is up. That might just be a console warning, but shouldn't affect the quiz. Can you try a hard refresh (ctrl+shift+R)? If that still doesn't work, what browser are you on?

I tried Chrome and Safari. It's working great on my phone, so probably zscalar.

For sure. Zscalar can block certain requests. Glad it works on your phone!

He says "trust your gut" about 12 times, but the whole lead up has 0 mention that he was worried he would not get paid. His only gut feelings seemed to be around tech issues.

Yes, I guess also trust the situation when it looks completely wrong. your gut is not lying when it sees that as well

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: