More

benob · 2026-04-22T15:38:05 1776872285

I get ~5 tokens/s on an M4 with 32G of RAM, using:

  llama-server \
   -hf unsloth/Qwen3.6-27B-GGUF:Q4_K_M \
   --no-mmproj \
   --fit on \
   -np 1 \
   -c 65536 \
   --cache-ram 4096 -ctxcp 2 \
   --jinja \
   --temp 0.6 \
   --top-p 0.95 \
   --top-k 20 \
   --min-p 0.0 \
   --presence-penalty 0.0 \
   --repeat-penalty 1.0 \
   --reasoning on \
   --chat-template-kwargs '{"preserve_thinking": true}'

35B-A3B model is at ~25 t/s. For comparison, on an A100 (~RTX 3090 with more memory) they fare respectively at 41 t/s and 97 t/s.

I haven't tested the 27B model yet, but 35B-A3B often gets off rails after 15k-20k tokens of context. You can have it to do basic things reliably, but certainly not at the level of "frontier" models.

danielhanchen · 2026-04-22T15:49:29 1776872969

We also made some dynamic MLX ones if they help - it might be faster for Macs, but llama-server definitely is improving at a fast pace.

https://huggingface.co/unsloth/Qwen3.6-27B-UD-MLX-4bit

DarmokJalad1701 · 2026-04-22T17:51:34 1776880294

What exactly does the .sh file install? How does it compare to running the same model in, say, omlx?

dunb · 2026-04-22T15:46:39 1776872799

Why use --fit on on an M4? My understanding was that given the unified memory, you should push all layers to the GPU with --n-gpu-layers all. Setting --flash-attn on and --no-mmap may also get you better results.

halJordan · 2026-04-23T23:38:01 1776987481

Meaningless question, fit will put everything on the gpu if it fits. Fa is default on. No-mmap is not an inference tradeoff and if you do turn it off you need to turn on direct io via -dio

What he should actually do is enable speculative decoding

fuomag9 · 2026-04-22T22:52:43 1776898363

I confirm with the GGUF version at q4, 35B-A3B starts going in thinking loops at 60k basically

kpw94 · 2026-04-22T16:48:51 1776876531

When you say tok/s here are you describing the prefill (prompt eval) token/s or the output generation tok/s?

(Btw I believe the "--jinja" flag is by default true since sometime late 2025, so not needed anymore)

benob · 2026-04-22T19:19:29 1776885569

Here is llama-bench on the same M4:

  | model                    |       size |     params | backend    | threads |            test |                  t/s |
  | ------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
  | qwen35 27B Q4_K_M        |  15.65 GiB |    26.90 B | BLAS,MTL   |       4 |           pp512 |         61.31 ± 0.79 |
  | qwen35 27B Q4_K_M        |  15.65 GiB |    26.90 B | BLAS,MTL   |       4 |           tg128 |          5.52 ± 0.08 |
  | qwen35moe 35B.A3B Q3_K_M |  15.45 GiB |    34.66 B | BLAS,MTL   |       4 |           pp512 |        385.54 ± 2.70 |
  | qwen35moe 35B.A3B Q3_K_M |  15.45 GiB |    34.66 B | BLAS,MTL   |       4 |           tg128 |         26.75 ± 0.02 |

So ~60 for prefill and ~5 for output on 27B and about 5x on 35B-A3B.

zargon · 2026-04-22T17:24:19 1776878659

If someone doesn't specifically say prefill then they always mean decode speed. I have never seen an exception. Most people just ignore prefill.

kpw94 · 2026-04-22T17:40:01 1776879601

But isn't the prefill speed the bottleneck in some systems* ?

Sure it's order of magnitude faster (10x on Apple Metal?) but there's also order of magnitude more tokens to process, especially for tasks involving summarization of some sort.

But point taken that the parent numbers are probably decode

* Specifically, Mac metal, which is what parent numbers are about

zargon · 2026-04-22T18:10:51 1776881451

Yes, definitely it's the bottleneck for most use cases besides "chatting". It's the reason I have never bought a Mac for LLM purposes.

It's frustrating when trying to find benchmarks because almost everyone gives decode speed without mentioning prefill speed.

mercutio2 · 2026-04-22T23:37:06 1776901026

oMLX makes prefill effectively instantaneous on a Mac.

Storing an LRU KV Cache of all your conversations both in memory, and on (plenty fast enough) SSD, especially including the fixed agent context every conversation starts with, means we go from "painfully slow" to "faster than using Claude" most of the time. It's kind of shocking this much perf was lying on the ground waiting to be picked up.

Open models are still dumber than leading closed models, especially for editing existing code. But I use it as essentially free "analyze this code, look for problem <x|y|z>" which Claude is happy to do for an enormous amount of consumed tokens.

But speed is no longer a problem. It's pretty awesome over here in unified memory Mac land :)

cyanydeez · 2026-04-22T18:16:40 1776881800

Using opencode and Qwen-Coder-Next I get it reliably up to about 85k before it takes too long to respond.

I tried the other qwen models and the reasoning stuff seems to do more harm than good.

wuschel · 2026-04-22T17:30:20 1776879020

How is the quality of model answers to your queries? Are they stable over time?

I am wondering how to measure that anyway.

benob · 2026-04-22T06:50:22 1776840622

I miss the comment tagging system: insightful, informative, interesting, funny. It would make sense for hn.

i_think_so · 2026-04-22T07:50:40 1776844240

You forgot Troll, you insensitive clod!

GuB-42 · 2026-04-22T08:08:46 1776845326

"Score: 5, Troll" is the ultimate achievement.

To put it that into context, some tags count as upvotes, others count as downvotes, "Troll" is a downvote. So to have your post labelled as "Troll" with a positive score, it has to have enough upvotes to compensate the penalty from the "Troll" votes, but without having another tag dominate. 5 is the maximum score.

"Score: 5, Troll" is therefore the mark of a very successful troll.

AnssiH · 2026-04-22T09:23:07 1776849787

They also have "Underrated" and "Overrated" which apply points but do not act as tags. So I guess the easiest way to get +5 Troll is to have many Troll and Underrated votes, if it works the way I think it does.

benob · 2026-04-12T07:22:42 1775978562

Space station tracking: https://flight-viz.com/cockpit.html?lat=40.64&lon=-73.78&alt...

benob · 2026-04-12T07:09:18 1775977758

I just realized that a hash function is nothing less than the output of a deterministic random number generator xored with some data

adrian_b · 2026-04-12T09:14:15 1775985255

Hash functions and PRNGs are closely related, they share many properties and they can be built from the same algorithmic components, so for many kinds of PRNGs there are corresponding kinds of hash functions and vice-versa.

Nevertheless, the purposes of hash functions and PRNGs are different and complementary.

A PRNG receives a short fixed-length value (the seed) and it expands it into a long pseudo-random sequence of arbitrary length.

A hash function receives a long input sequence of arbitrary length and it generates a short fixed-length pseudo-random value.

Good PRNGs are injective functions and good hash functions are surjective functions.

Normally the design methods for PRNGs and for hash functions should be presented together, because it is easy to interconvert algorithms for one of them with algorithms for the other. For instance, given a good hash function one could make a PRNG by computing the hashes of a sequence of numbers or the hashes of a sequence of strings of increasing length, and given a good PRNG one could make a hash function by accumulating somehow the input into a PRNG seed and taking the first generated number, or better by using input chunks as seeds and then accumulating the first generated numbers into a single value.

However for a successful conversion between PRNG and hash function algorithms, the source algorithm may have have to be overdesigned, to guarantee good enough properties even after the conversion.

When an algorithm is designed directly as a hash function or as a PRNG, with clearly specified requirements, it can be designed only as good as strictly necessary, enabling thus a better performance.

derriz · 2026-04-12T08:32:17 1775982737

Could you explain what you mean here?

Hashes are _functions_ so provide the same output given the same input.

If you don't reseed the RNG after every hash computation, then you break this vital property of hashes.

And if you do reseed, then your claim boils down to "every hash function is just an XOR against a contstant" which certainly is not true either.

eru · 2026-04-12T07:54:51 1775980491

Sorry, what?

That might we one very particular way to write a hash function, but it's far from the only one.

Believe it or not, for some purposes taking the first few bytes of a string or even just the length of the string are good hash functions.

andai · 2026-04-12T08:30:22 1775982622

Well, that's technically also a deterministic random number generator! (I want to say it's not a great one, but... that's apparently context-dependent!)

What are those purposes?

eru · 2026-04-12T10:24:23 1775989463

If your input is i.i.d. random, then truncating works great. Eg if your keys are UUIDs then truncating can work well.

Another use:

Suppose you write a tool like rmlint that is looking for duplicate files. Generally, you compute some hash for each file, see if you got any duplicates, and then compare the relevant files directly.

A traditional hash like crc or sha256 takes O(n) to compute. But for files you can start with some cheaper hashes, like file length. After all, files of different length can't have the same content. Taking the first few bytes of your file is another cheap 'hash' you can compute.

Only when these cheap 'hashes' show that you have a potential duplicate, do you go and pay for a more expensive hash.

benob · 2026-04-12T05:22:50 1775971370

No, the failure is the human written prompt

not_that_d · 2026-04-12T08:23:29 1775982209

You know, after a while this excuse is not valid anymore.

roywiggins · 2026-04-12T21:24:29 1776029069

If they're that hard to prompt maybe it's easier just to write the blog posts yourself.

benob · 2026-04-05T06:42:32 1775371352

The author emphasizes accessibility and coherence as a benefit but another interesting one is composability which does not emerge naturally in the world of UI. Create a UI for a pair of websites like a command line for grep and wc. LLMs already provide that but under the natural language interaction primitive. UI could allow for branded experiences, ad delivery and whatnot in ways that natural language doesn't.

benob · 2026-04-01T16:48:34 1775062114

"That allows us to license the open source project under the more permissive MIT license."

benob · 2026-03-31T06:03:33 1774937013

I would say:

- decomposition: discover a more general form of Fourrier transform to untangle the underlying factors

- memorization: some patterns are recurrent in many domains such as power low

- multitask: exploit cross-domain connections such as weather vs electricity

benob · 2026-03-31T05:48:27 1774936107

Ollama is a user-friendly UI for LLM inference. It is powered by llama.cpp (or a fork of it) which is more power-user oriented and requires command-line wrangling. GGML is the math library behind llama.cpp and GGUF is the associated file format used for storing LLM weights.

redmalang · 2026-03-31T07:04:47 1774940687

i've found llama.cpp (as i understand it, ollama now uses their own version of this) to work much better in practice, faster and much more flexible.

benob · 2026-03-25T07:02:58 1774422178

This is the worst lay-people explanation of an AI component I have seen in a long time. It doesn't even seem AI generated.

spencerflem · 2026-03-25T07:04:42 1774422282

I think it is though-

“ TurboQuant, QJL, and PolarQuant are more than just practical engineering solutions; they’re fundamental algorithmic contributions backed by strong theoretical proofs. These methods don't just work well in real-world applications; they are provably efficient and operate near theoretical lower bounds.”

integralid · 2026-03-25T08:08:56 1774426136

I also instinctively reacted to that fragment, but at this point I think this is overreacting to a single expression. It's not just a normal thing to say in English, it's something people have been saying for a long time before LLMs existed.

nvme0n1p1 · 2026-03-25T08:31:23 1774427483

There are tells all over the page:

> Redefining AI efficiency with extreme compression

"Redefine" is a favorite word of AI. Honestly no need to read further.

> the key-value cache, a high-speed "digital cheat sheet" that stores frequently used information under simple labels

No competent engineer would describe a cache as a "cheat sheet". Cheat sheets are static, but caches dynamically update during execution. Students don't rewrite their cheat sheets during the test, do they? LLMs love their inaccurate metaphors.

> QJL: The zero-overhead, 1-bit trick

> It reduces each resulting vector number to a single sign bit (+1 or -1). This algorithm essentially creates a high-speed shorthand that requires zero memory overhead.

Why does it keep emphasizing zero overhead? Why is storing a single bit a "trick?" Either there's currently an epidemic of algorithms that use more than one bit to store a bit, or the AI is shoving in extra plausible-sounding words to pad things out. You decide which is more likely.

It's 1:30am and I can't sleep, and I still regret wasting my time on this slop.

TeMPOraL · 2026-03-25T14:18:12 1774448292

I say you're fixating on the wrong signal here. "Redefine" and "cheat sheet" are normal words people frequently use, and I see worse metaphors in human-written text routinely.

It's the structure and rhythm at the sentence and paragraph levels that's the current tell, as SOTA LLMs all seem to overuse clarification constructs like "it's not X, it's Y" and "it's X, an Y and a Z", and "it's X, it's essentially doing Y".

Thing is, I actually struggle to find what's so off-putting about these, given that they're usually used correctly. So far, the best hypothesis I have for what makes AI text stand out is that LLM output is too good. Most text written by real humans (including my own) is shit, with the best of us caring about communicating clearly, and most people not even that; nobody spends time refining the style and rhythm, unless they're writing a poem. You don't expect a blog post or a random Internet article (much less a HN comment) to be written in the same style as a NYT bestseller book for general audience - but LLMs do that naturally, they write text better at paragraph level than most people ever could, which stands out as jarring.

> Either there's currently an epidemic of algorithms that use more than one bit to store a bit, or the AI is shoving in extra plausible-sounding words to pad things out. You decide which is more likely.

Or, those things matter to authors and possibly the audience. Which is reasonable, because LLMs made the world suddenly hit hard against global capacity constraints in compute, memory, and power; between that and edge devices/local use, everyone who pays attention is interested in LLM efficiency.

snovv_crash · 2026-03-25T15:57:01 1774454221

LLM prose is very bland and smooth, in the same way that bland white factory bread is bland and smooth. It also typically uses a lot of words to convey very simple ideas, simply because the data is typically based on a small prompt that it tries to decompress. LLMs are capable of very good data transformation and good writing, but not when they are asked to write an article based on a single sentence.

TeMPOraL · 2026-03-25T16:29:33 1774456173

That's true. I.e. it's not that they're not capable of doing better, it's just whoever's prompting them is typically too lazy to add an extra sentence or three (or a link) to steer it to a different region of the latent space. There's easily a couple dozen dimensions almost always left at their default values; it doesn't take much to alter them and nudge the model to sample from a more interesting subspace style-wise.

(Still, it makes sense to do it as a post-processing style transfer space, as verbosity is a feature while the model is still processing the "main" request - each token produced is a unit of computation; the more terse the answer, the dumber it gets (these days it's somewhat mitigated by "thinking" and agentic loops)).

ptx · 2026-03-30T13:40:17 1774878017

> they write text better

Not if you view text as a medium for communication, i.e. as a way for a sender to serialize some idea they have in their mind and transfer it to the reader for deserialization.

The AI doesn't know what the sender meant. It can't add any clarity. It can only corrupt and distort whatever message the sender was trying to communicate.

Fixating on these tells is a way for the receiver of the message to detect that it has been corrupted and there is no point in trying to deserialize it. The harder you try to interpret an AI-generated message, the less sense it will make.

spencerflem · 2026-03-25T17:22:15 1774459335

Because it’s a lot of fluff to convey things in a way that’s not very accurate.

veunes · 2026-03-25T09:25:58 1774430758

Looks like Google canned all their tech writers just to pivot the budget into H100s for training these very same writers

snovv_crash · 2026-03-25T10:55:04 1774436104

Capex vs. opex

radarsat1 · 2026-03-25T17:27:29 1774459649

> "Redefine" is a favorite word of AI. Honestly no need to read further.

You're not wrong, but it certainly is an annoying outcome of AI that we're not allowed to use.. words.. anymore.

roywiggins · 2026-03-25T13:42:57 1774446177

"The X Trick" or "The Y Dilemma" or similar snowclones in a header is also a big AI thing. Humans use this construction too, but LLMs love it out of all proportion. I call it The Ludlum Delusion (since that's how every Robert Ludlum book is titled).

pqs · 2026-03-25T08:43:35 1774428215

There is also the possibility that the article when through the hands of the company's communication department which has writers that probably write at LLM level.

g-mork · 2026-03-25T10:19:57 1774433997

Another instinctual reaction here. This specific formulation pops out of AI all the time, there might as well have been an emdash in the title

NoahZuniga · 2026-03-25T10:26:09 1774434369

Genius new idea: replace the em-dashes with semicolons so it looks less like AI.

tux3 · 2026-03-25T11:29:31 1774438171

You're absolutely right. That's not just a genius idea; it's a radical new paradigm.

Quarrel · 2026-03-25T14:34:50 1774449290

Damnit.

There goes another bit of my writing style that will get mistaken for an LLM.

zarzavat · 2026-03-25T10:30:50 1774434650

I read "this clever step" and immediately came to the comments to see if anyone picked up on it.

It reads like a pop science article while at the same time being way too technical to be a pop science article.

Turing test ain't dead yet.

TeMPOraL · 2026-03-25T14:02:15 1774447335

> Turing test ain't dead yet.

Only because people are lazy, and don't bother with a simple post-processing step: attach a bunch of documents or text snippets written by a human (whether yourself or, say, some respected but stylistically boring author), and ask the LLM to match style/tone.

benob · 2026-03-25T07:09:43 1774422583

Maybe they quantized a bit too much the model parameters...

BenoitP · 2026-03-25T08:02:07 1774425727

It is AI generated. Or was written by someone a bit far from the technical advances IMHO. The Johnson-Lindenstrauss Lemma is a very specific and powerful concept, when in the article the QLJ explanation is vacuous. A knowledgeable human would not have left the reader wanting for how that relates to the Lemma.

davesque · 2026-03-25T17:28:03 1774459683

Yeah, and some parts of the article are just bizarre:

> Instead of looking at a memory vector using standard coordinates (i.e., X, Y, Z) that indicate the distance along each axis, PolarQuant converts the vector into polar coordinates using a Cartesian coordinate system. This is comparable to replacing "Go 3 blocks East, 4 blocks North" with "Go 5 blocks total at a 37-degree angle”

Why bother explaining this? Were they targeting the high school and middle school student reader base??

jeeeb · 2026-03-27T08:15:45 1774599345

It feels very much like Gemini’s writing style - overly excited with lots of unnecessary contrasts.