I just tested this on a bug fixing benchmark I'm working on. It did not perform ...

walrus01 · 2026-05-30T01:40:03 1780105203

I personally find any model smaller than something like Qwen 3.6 35B-A3B (8-bit quantization, about 49GB memory usage when loaded into llama.cpp) to be too "stupid" for reliable use.

I would much rather not run the model on my local laptop hardware and offload that to some system sitting under my desk in my home office, accessible via VPN, than take the risk of using an unreliable and flaky tool for the convenience of having it on the same hardware on my lap.

I pay very little attention to 8 billion or whatever (or even much smaller) models these days and I don't feel like I'm missing much.

satvikpendem · 2026-05-30T02:31:05 1780108265

Qwen 3.6 27B dense is much better than the 35B MoE model for coding, not sure if you've tried that yet.

walrus01 · 2026-05-30T02:33:37 1780108417

yes, I have, I use both. 27B slower in tok/s due to density, obviously, 35B-A3B for speed on simpler tasks.

intothemild · 2026-05-30T09:59:49 1780135189

You should enable MTP now that its available.

LLamaCPP has had some massive updates in the last week or so.

npodbielski · 2026-05-30T14:05:01 1780149901

Yes, Qwen 3.6 MoE is hitting like 80-90tk/s on Strix halo. On R9700 I had like 170t/s. It was not possible to keep up. But MoE is circling very often. I switch then to dense model and have 20-30t/s but it is able to solve quite a lot of tasks.

alfiedotwtf · 2026-05-30T17:45:38 1780163138

For those speeds, I’m assuming Q4?

npodbielski · 2026-05-31T06:05:05 1780207505

Ud_Q4_k_xl

intothemild · 2026-05-30T14:32:00 1780151520

I get 50-60t/s tg on my r9700 with the dense, unsloth MTP quant UD-Q5_K_XL, K@8/V@4 256k context.

Using Vulkan backend.

``` llama-server -fa on -t 7 -ngl 999 --mlock --fit off --kv-offload --no-webui --metrics --chat-template-kwargs {"preserve_thinking": true} -b 2048 -ub 1024 -m /mnt/models/unsloth/Qwen3.6-27B-MTP-GGUF/Qwen3.6-27B-UD-Q5_K_XL.gguf --mmproj /mnt/models/unsloth/Qwen3.6-27B-MTP-GGUF/mmproj-F16.gguf -c 262144 --kv-unified -ctk q8_0 -ctv q4_0 --spec-type draft-mtp --spec-draft-n-max 3 --spec-draft-ngl 99 --alias unsloth/Qwen3.6-27B-MTP-GGUF --temp 0.60 --top-k 20 --top-p 0.95 --min-p 0.00 --presence-penalty 0.00 --repeat-penalty 1.00 ```

sheeshkebab · 2026-05-30T15:40:21 1780155621

27b is slow as molasses vs 35b on local stuff I have (m5 max). Mtp doesn’t make any difference either.

theanonymousone · 2026-05-30T06:35:02 1780122902

Have you seen the 8bit quantisation matter a lot? The "consensus" in r/LocalLlama is that up to 4 bits the loss is tolerable.

walrus01 · 2026-05-30T06:55:50 1780124150

Absolutely. Difference in Q6 vs Q8 is not as immediately noticeable, but if I test by starting from a blank slate context and giving it the same complicated task with Q4 vs a Q8 GGUF file loaded, the difference is apparent. The Q4 will struggle or do 'stupid' things with even simple bash or python. Q4 might not be as noticeable for conversational purely text one on one interaction with an LLM, but when you dig deeper into something that's more esoteric in a training dataset than a chat conversation, absolutely a big gap there.

I think some of the folks in the local llm social media communities are using them for things like company-hosted customer service chat bots, or purely english text writing stuff where Q4 will probably not cause a problem. For more discrete technical work I stick pretty much exclusively to Q8.

theanonymousone · 2026-05-30T09:45:49 1780134349

Thanks a lot. How about Q8 vs FP16/BF16? Have you checked them too?

walrus01 · 2026-05-30T21:32:38 1780176758

I have not spent a lot of time running FP16 'full precision' versions of some things, but as the other commenter says, it's not much difference. There's a really wide array of benchmarks and tests from a lot of third parties unrelated to the trainer of the AI models that shows at most a two percent difference in score and capability between BF16 and Q8.

bradfa · 2026-05-30T12:35:33 1780144533

Q8 quant is very minimal fall off in terms of KLD against the lab 16 bit. If you have the memory for BF16 KV-cache (which is usually easier to stomach) then the Q8 is very close. But even Q8 quant model with Q8 KV-cache is very close.

Smaller quants for the model start to fall off but more importantly, smaller KV-cache quants fall off much faster so avoid less than Q8 there.

alfiedotwtf · 2026-05-30T17:47:28 1780163248

It’s not a general rule, and depends highly on the model and the quantisation used. Don’t guess, Unsloth sometimes publish graphs in their tutorials showing the error rate vs file size… sometimes Q4 is great, other times I go for Q6

thot_experiment · 2026-05-30T02:39:05 1780108745

q6 is fine for that qwen with ctx @ q8, and the dense models of that size are solid at q4 with q8 ctx

debazel · 2026-05-30T00:17:03 1780100223

I tried it with OpenCode and it is borderline incapable of using tool calls, so that might be why it is doing so bad on your test.

peder · 2026-05-30T00:48:43 1780102123

I just did the same. Absolutely awful. I assume OpenCode's heavy context is a problem, and it's probably better to use Liquid's own OpenCode alternative for this.

solarkraft · 2026-05-30T09:24:03 1780133043

Where can I find that agent harness? A look at their Docs and asking Gemini yielded no results.

Edit: Is it this? https://github.com/Liquid4All/cookbook/tree/main/examples/lo...

FYI: Opencode is very well tuned for Qwen models, but I haven’t found it that rare for niche models to perform badly in it.

h14h · 2026-05-30T15:07:43 1780153663

That's not all that surprising, IMO. From what I understand, LiquidAI is focusing pretty narrowly on building models that operate as the "agentic core" of a larger system.

If I were going to use this model, I'd be looking to use it more as is the primary chat interface of a larger system, and having it orchestrate & delegate tasks to other places via tool calls. It's not quite as exciting on the surface as a local "do it all" model, but it does enable some pretty neat use-cases, IMO.

I'm imagining a local agent that is super low latency, works entirely offline, and capable of queuing up complex tasks for larger/smarter cloud agents which execute them asynchronously.

onlyrealcuzzo · 2026-05-30T18:13:11 1780164791

Interesting...

Two of the other responses speak about it being abysmal at tool calling.

Overall, I'm pretty impressed a model this small can find/fix ~12% of bugs with crappy context - even if they're about as easy as possible to fix.

I just assumed it would perform better, given all the advancements in the space.

It's possible 1B active parameters is just not enough - even if it has 8B params of knowledge to reason through bugs.

Playing around with the context I fed it, it was able to fix up to ~34% of bugs vs ~46% for Qwen2.5-Coder-3B and ~54% for Qwen2.5-Coder-7B.

XCSme · 2026-05-30T00:33:10 1780101190

I will test it when it's accessible via OpenRouter, but the previous LFM2 model (lfm-2-24b-a2b) didn't do well on my tests, it got only 1/20 questions/tasks right, way below Gemma 31B or Qwen 35b-a3b (those get like 10/20 right)

BoorishBears · 2026-05-30T13:57:57 1780149477

I tested it against Gemma 4 31B and it's expectedly not favorable for world knowledge.

But even against E4B it's shaky, which is surprising given how many tokens they trained on. I guess it was on a lot of synthetic data.

mike_hearn · 2026-05-30T15:40:14 1780155614

It's not intended to be a coding model, however.

HanClinto · 2026-05-29T22:56:02 1780095362

Some of the coding-specific fine-tunes were really impressive boosts. Qwen2.5-3B-Instruct is also available [0] -- if it's not too much to ask, I'd be curious how more general models stack up in your benchmark?

[0] - https://huggingface.co/Qwen/Qwen2.5-3B-Instruct