Hacker Newsnew | past | comments | ask | show | jobs | submit | coder543's commentslogin

> meaning Google quantized the model to 4 bit and stored the result in BF16 format for compatibility and convenience to downstream packers.

You also misunderstand what is happening. Google did not do that. Google further trained the original model with an objective of minimizing error when quantized to 4-bit. The BF16 QAT is not an upscaled 4-bit model. When quantized to 4-bit, it should lose less accuracy than a typical 16-bit model loses when quantized to 4-bit, but the loss is not zero, because it is not based on a 4-bit model.

The Gemma 3 QAT report was a bit clearer:

https://developers.googleblog.com/en/gemma-3-quantized-aware...

"Instead of just quantizing the model after it's fully trained, QAT incorporates the quantization process during training. QAT simulates low-precision operations during training to allow quantization with less degradation afterwards for smaller, faster models while maintaining accuracy. Diving deeper, we applied QAT on ~5,000 steps using probabilities from the non-quantized checkpoint as targets. We reduce the perplexity drop by 54% (using llama.cpp perplexity evaluation) when quantizing down to Q4_0."

The BF16 is just trained to be more resistant to simulated quantization, which helps when it is actually quantized. Google is not doing post-training on the 4-bit model directly.


Are there evidence that this approach helps maintain "accuracy" performance when quantized? It sounds a bit like mxfp4 with gpt-oss, which was a confusing model upon release.

I have just been humbled by the Gemma 4 26B QAT build (unsloth's version), which insisted repeatedly that I am wrong in my requirements for some niche wordpress code, which cannot be satisfied.

I am a good WP developer so I kept prodding it and it kept insisting, and it explained with clarity. Turns out it is right and I was wrong, as I would have found out if I'd written the code myself.

I've been using this particular test for days, experimenting in ways to generate and prompt code. The 4-bit quantisation of the pre-QAT model does not catch this error. And nor can the Qwen 3.6 sparse model, which confidently blazed past it and never mentioned it.

(FWIW neither did plain ChatGPT; maybe Codex would)

Anecdotal, but there you go. I am somewhat weirded out by it.


So what we want now is unsloth (or anyone) to release 4/6-bit quantized models of these releases?

Yep, Unsloth already did, as linked in the comment at the top of this thread

Closed source?

IntelliJ: https://github.com/JetBrains/intellij-community

PyCharm: https://github.com/JetBrains/intellij-community/tree/master/...

Android Studio: https://android.googlesource.com/platform/tools/adt/idea/+/r...

Yes, they might offer extended proprietary editions/plugins in addition, but the IDEs themselves are open source.


Oh, this is great!

I've filed bugs with JetBrains before and had them take months getting to my ticket, often with multiple hand-offs between team members; being able to provide a potential fix should make the process much faster.


None of these are the "norm". The IDEs OC mentioned all have a much larger install base.


Do all of those installed on my various machines for the express purpose of a last resort of building some obscure crap about once a couple years count? Because of course I have them installed.. somewhere. And of course I wouldn't imagine using that crap daily.


[citation needed]

IDEs made by JetBrains are huge. At this point, they're basically the standard option for several JVM languages.


You can't expect the average person on HN to admit to using a JVM-based language. That would mean they write boring business software rather than cool ad surveillance tech.


I'm always taken aback a little when I read through HN and see how little mind share Kotlin and its ecosystem has here. JetBrains has done a pretty good job of creating something that can fill many different niches (especially considering they're not one of the giant tech companies with virtually unlimited budgets), but it seems people don't even realize it exists, for whatever reason. It doesn't even need to run on a JVM in many cases, if that's some sort of barrier.


The computer pops up a warning if you plug a fast device into the slow port, which is a lot more informative for the average user than a tiny label that most users wouldn’t even read.

Labels would be nice, I guess, but their absence is hardly a dealbreaker.


Windows has been showing popup USB speed warnings since at least Windows XP.... so 25 years?

Let's not use this cope to mislead anyone into thinking this is a unique Mac innovation (it isn't) that trumps this abomination of human factors (it doesn't).


I have never ever seen Windows provide this warning even once just because there is a faster port on the machine and the user plugged the device into the wrong one. Please provide a source for this claim that you are making. Citation absolutely needed.

In the unlikely case that this feature exists thanks to Microsoft, I would like to say that is great, because it is much more user friendly than only having tiny labels. But since I’ve never seen this feature work before, it seems to me that it must be broken, if it exists at all.


The OP might be remembering this (link) style of message from the Windows XP days. I don't think I've seen it for windows 7/10/11, sadly.

It will warn you if you're charging over the slower to charge port, though.

https://superuser.com/questions/1022542/windows-10-display-a...


These warning messages do exist, at least for if your computer supports USB4 but not on that port, or thunderbolt / DP alternate mode but not on that port

https://learn.microsoft.com/en-us/windows-hardware/drivers/u...


MTP requires a separate KV cache, so there is more memory overhead than just the weights of the MTP model, but it's a manageable amount.


From the linked post, it didn't read like a separate KV cache was needed:

> The draft models seamlessly utilize the target model's activations and share its KV cache, meaning they don't have to waste time recalculating context the larger model has already figured out.


That's great news. That has not been the case with other MTP implementations like Qwen3.5, but I see the section in the article saying Google introduced some architectural optimizations to make this possible.


Your “benchmark” is invalid. Penalizing the model because the hosting environment is being DDoSed by users a few hours after launch is utter nonsense.

I see that you tried to justify this lower in the thread, but no… it completely invalidates your benchmark. You are not testing the model. You are conflating one specific model host and model performance, and then claiming you are benchmarking the model. All major models are hosted by multiple different services.

In the real world, clients will just retry if there is a server error, and that will not impact response quality at all, and the workflow the model is being used in will not fail. If a workflow is so poorly coded that it doesn’t even have retry logic, then that workflow is doomed no matter which host you use. But again, reliability of the host is separate from the model.

You can make your benchmark valid by having separate leaderboards for model quality and host reliability. I’m not saying to throw the whole thing away. But the current claim is not valid.

And you’re also making an unsourced claim that everyone else has already determined this model sucks? Nah. The first result from Artificial Analysis shows good things: https://x.com/ArtificialAnlys/status/2047547434809880611

But I am still waiting to see the results from the full suite of AA benchmarks.


Their benchmark is full of nonsense like this and I'm amazed the fact most of their interactions on the site are promoting it hasn't gotten the account banned for spam.

They have Gemini 2.5 Flash ahead of Opus 4.6: https://aibenchy.com/compare/anthropic-claude-opus-4-6-mediu...

Absolutely worthless benchmark but every release has a comment linking to this nonsense.


The description specifically says:

"Kimi-K2.6 adopts the same native int4 quantization method as Kimi-K2-Thinking."


From the page:

> Import from anywhere. Start from a text prompt, upload images and documents (DOCX, PPTX, XLSX), or point Claude at your codebase. You can also use the web capture tool to grab elements directly from your website so prototypes look like the real product.


Thank you, I should RTFA next time.


Artificial Analysis hasn't posted their independent analysis of Qwen3.6 35B A3B yet, but Alibaba's benchmarks paint it as being on par with Qwen3.5 27B (or better in some cases).

Even Qwen3.5 35B A3B benchmarks roughly on par with Haiku 4.5, so Qwen3.6 should be a noticeable step up.

https://artificialanalysis.ai/models?models=gpt-oss-120b%2Cg...

No, these benchmarks are not perfect, but short of trying it yourself, this is the best we've got.

Compared to the frontier coding models like Opus 4.7 and GPT 5.4, Qwen3.6 35B A3B is not going to feel smart at all, but for something that can run quickly at home... it is impressive how far this stuff has come.


Qwen models commonly get accused of benchmaxxing though. Just something to keep in mind when weighing the standard benchmarks.


Every model release gets accused of that, including the flagship models.


Less so for Gemma-4 because it falls behind Qwen on benchmarks. Tests for benchmaxxing are also strongly suggestive: https://x.com/bnjmn_marie/status/2041540879165403527


No… seriously. Every model release is accused. Including Opus, GPT-5.4, whatever. And yes, including smaller models that are not the top in every benchmark.

My own experiences with Gemma 4 have been quite mediocre: https://www.reddit.com/r/LocalLLaMA/comments/1sn3izh/comment...

I would almost be tempted to call it benchmaxed if that term weren’t such a joke at this point. It is a deeply unserious term these days.

Gemma 4 is worse than its benchmarks show in terms of agentic workflows. The Qwen3.x models are much better; not benchmaxed. I have tested this extensively for my own workflows. Google really needs to release Gemma 4.1 ASAP. I really hope they’re not planning to just wait another calendar year like they did for Gemma 3 -> 4 with no intermediate updates.

And the lead author on the paper replied to that tweet to say that the scores would need to be greater than 80 to show actual contamination: https://x.com/MiZawalski/status/2043990236317851944?s=20


Not true. With a MoE, you can offload quite a bit of the model to CPU without losing a ton of performance. 16GB should be fine to run the 4-bit (or larger) model at speeds that are decent. The --n-cpu-moe parameter is the key one on llama-server, if you're not just using -fit on.


I've been way out of the local game for a while now, what's the best way to run models for a fairly technical user? I was using llama.cpp in the command line before and using bash files for prompts.


Running llama-server (it belongs to llama.cpp) starts a HTTP server on a specified port.

You can connect to that port with any browser, for chat.

Or you can connect to that port with any application that supports the OpenAI API, e.g. a coding assistant harness.


That is an extremely strange article, in my opinion. They test Gemma 4 31B, but they use Qwen3 32B, DeepSeek R1, and Kimi K2, which are all outdated models whose replacements were released long before Gemma 4? Qwen3.5 27B would have done far better on these tests than Qwen3 32B, and the same for DeepSeek V3.2 and Kimi K2.5. Not to mention the obvious absence of GLM-5.1, which is the leading open weight model right now.

The article also seems to brush over the discovery phase, which seems very important. If it were as easy as they say, then the models should have been let loose and we would see if they actually found these bugs, and how many false positives they marked as critical. Instead, they pointed the models at the flawed code directly.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: