Hacker Newsnew | past | comments | ask | show | jobs | submit | trvz's commentslogin

Most knowledge workers aren't willing to put in the effort so they're getting their work done efficiently.

If you have to ask then your GPU is too small.

With 16 GB you'll be only able to run a very compressed variant with noticable quality loss.


Not true. With a MoE, you can offload quite a bit of the model to CPU without losing a ton of performance. 16GB should be fine to run the 4-bit (or larger) model at speeds that are decent. The --n-cpu-moe parameter is the key one on llama-server, if you're not just using -fit on.

I've been way out of the local game for a while now, what's the best way to run models for a fairly technical user? I was using llama.cpp in the command line before and using bash files for prompts.

> If you have to ask then your GPU is too small.

What's the minimum memory you need to run a decent model? Is it pretty much only doable by people running Macs with unified memory?


It's worth noting now there are other machines than just Apple that combine a powerful SoC with a large pool of unified memory for local AI use:

> https://www.dell.com/en-us/shop/cty/pdp/spd/dell-pro-max-fcm...

> https://marketplace.nvidia.com/en-us/enterprise/personal-ai-...

> https://frame.work/products/desktop-diy-amd-aimax300/configu...

etc.

But yes, a modern SoC-style system with large unified memory pool is still one of the best ways to do it.


My Mac Studio with 96GB of RAM is maybe just at the low end of passable. It's actually extremely good for local image generation. I could somewhat replace something like Nano Banana comfortably on my machine.

But I don't need Nano Banana very much, I need code. While it can, there's no way I would ever opt to use a local model on my machine for code. It makes so much more sense to spend $100 on Codex, it's genuinely not worth discussing.

For non-thinking tasks, it would be a bit slower, but a viable alternative for sure.


You just need to adjust your workflow to use the smaller models for coding. It's primarily just a case of holding them wrong if you end up with worse outputs.

32 GiB of VRAM is possible to acquire for less than $1000 if you go for the Arc Pro B70. I have two of them. The tokens/sec is nowhere near AMD or NVIDIA high end, but its unexpectedly kind of decent to use. (I probably need to figure out vLLM though as it doesn't seem like llama.cpp is able to do them justice even seemingly with split mode = row. But still, 30t/s on Gemma 4 (on 26B MoE, not dense) is pretty usable, and you can do fit a full 256k context.)

When I get home today I totally look forward to trying the unsloth variants of this out (assuming I can get it working in anything.) I expect due to the limited active parameter count it should perform very well. It's obviously going to be a long time before you can run current frontier quality models at home for less than the price of a car, but it does seem like it is bound to happen. (As long as we don't allow general purpose computers to die or become inaccessible. Surely...)


New versions of llama.cpp have experimental split-tensor parallelism, but it really only helps with slow compute and a very fast interconnect, which doesn't describe many consumer-grade systems. For most users, pipeline parallelism will be their best bet for making use of multi-GPU setups.

Yeah, I was doing split tensor and it seemed like a wash. The Arc B70s are not huge on compute.

Right now I'm only able to run them in PCI-e 5.0 x8 which might not be sufficient. But, a cheap older Xeon or TR seems silly since PCI-e 4.0 x16 isn't theoretically more bandwidth than PCI-e 5.0 x8. So it seems like if that is really still bottlenecked, I'll just have to bite the bullet and set up a modern HEDT build. With RAM prices... I am not sure there is a world where it could ever be worth it. At that point, seems like you may as well go for an obscenely priced NVIDIA or AMD datacenter card instead and retrofit it with consumer friendly thermal solutions. So... I'm definitely a bit conflicted.

I do like the Arc Pro B70 so far. Its not a performance monster, but it's quiet and relatively low power, and I haven't run into any instability. (The AMDGPU drivers have made amazing strides, but... The stability is not legendary. :)

I'll have to do a bit of analysis and make sure there really is an interconnect bottleneck first, versus a PEBKAC. Could be dropping more lanes than expected for one reason or another too.


You could fit your HEDT with minimum RAM and a combination of Optane storage (for swapping system RAM with minimum wear) and fast NAND (for offloading large read-only data). If you have abundant physical PCIe slots it ought to be feasible.

NVIDIA 5070 Ti can run Gemma 4 26B at 4-bit at 120 tk/s.

Arc Pro B70 seems unexpectedely slow? Or are you using 8-bit/16-bit quants.


Unfortunately it really is running this slow with Llama.cpp, but of course that's with Vulkan mode. The VRAM capacity is definitely where it shines, rather than compute power. I am pretty sure that this isn't really optimal use of the cards, especially since I believe we should be able to get decent, if still sublinear, scaling with multiple cards. I am not really a machine learning expert, I'm curious to see if I can manage to trace down some performance issues. (I've already seen a couple issues get squashed since I first started testing this.)

I've heard that vLLM performs much better, scaling particularly better in the multi GPU case. The 4x B70 setup may actually be decent for the money given that, but probably worth waiting on it to see how the situation progresses rather than buying on a promise of potential.

A cursory Google search does seem to indicate that in my particular case interconnect bandwidth shouldn't actually be a constraint, so I doubt tensor level parallelism is working as expected.


A bit like asking how long is a piece of string.

It's twice as long as from one end to the middle.

More like "about how long of a string do I need to run between two houses in the densest residential neighborhood of single-family homes in the US?"

It’s also doable with AMD Strix Halo.

Macs with unified memory are economical in terms of $/GB of video memory, and they match an optimized/home built GPU setup in efficiency (W/token), but they are slow in terms of absolute performance.

With this model, since the number of active parameters is low, I would think that you would be fine running it on your 16GB card, as long as you have, say 32GB of regular system memory. Temper your expectations about speed with this setup, as your system memory and CPU are multiple times slower than the GPU, so when layers spill over you will slow down.

To avoid this, there's no need to buy a Mac -- a second 16GB GPU would do the trick just fine, and the combined dual GPU setup will likely be faster than a cheap mac like a Mac mini. Pay attention to your PCIe slots, but as long as you have at least an x4 slot for the second GPU, you'll be fine (LLM inference doesn't need x8 or x16).


Obviously going to depend on your definition of "decent". My impression so far is that you will need between 90GB to 100GB of memory to run medium sized (31B dense or ~110B MoE) models with some quantization enabled.

I’m running Gemma4 31B (Q8) on my 2 4090s (48GB) with no problem.

I have the same setup but tried paperclip ai with it and it seems to me that either i'm unable to setup it properly or multiply agents struggle with this setup. Especially as it seems that paperclip ai and opencode (used for connection) is blowing up the context to 20-30k

Any tips around your setup running this?

I use lmstudio with default settings and prioritization instead of split.


I asked AI for help setting it up. I use 128k context for 31B and 256k context for 26B4A. Ollama worked out of the box for me but I wanted more control with llama.cpp.

My command for llama-server:

llama-server -m /models/gemma-4-26B-A4B-it-UD-Q8_K_XL.gguf -ngl 99 -sm layer -ts 10,12 --jinja --flash-attn on --cont-batching -np 1 -c 262144 -b 4096 -ub 512 -ctk q8_0 -ctv q8_0 --host 0.0.0.0 --port 8080 --timeout 18000


No, GP is excessively restrictive. Llama.cpp supports RAM offloading out of the box.

It's going to be slower than if you put everything on your GPU but it would work.

And if it's too slow for your taste you can try the quantized version (some Q3 variant should fit) and see how well it works for you.


Running q3 xss with full and quantizised context as options on a 16gb gpu and still has pretty decent quality and fitting fine with up to 64k context.

Aren't 4bits model decent? Since, this is an MOE model, I'm assuming it should have respectable tk/s, similar to previous MOE models.

Meanwhile, Backblaze still happily backups up the 100TB+ I have on various hard drives with my Mac Pro.

Does it? How do you know?

If they start excluding random content (eg: .git) without effective notice, maybe they AREN'T backing up everything you think they are.


You don’t do quarterly restore tests?

How do you do that?

My naive idea: Download 100 TB every 3 month to a 2nd device, create a list of files restored, validate checksums with the original machine, make a list of files differing and missing, check which ones are supposed to be missing? That sounds like a full time job.


Now days its: hi claude, write script in language I hate the less which will ...

I don't really care about the content, but European software is also when you switch to the tab the energy consumption of your MacBook quadruples due to some inane animations.

You may not be able to justify violence, but sometimes you can understand it.

They always have the option to stop accepting new customers when their infrastructure is peaked out instead of lowering quality for everyone.

You don't know whether this is due to infrastructure capacity or has other reasons (organizational). Also, "let's stop accepting new customers" is probably not a realistic choice for a hundred reasons.

That would mean in a way accepting that they are suddenly a service company with the aim to create revenue by selling services to customers for money.

You can't stop accepting new customers unless you're fine with killing your potential future customer base. That's a ridiculous suggestion.

Either your current customers or your potential future customers are going to be unhappy so long as compute resources are finite. Take your pick.


> That's a ridiculous suggestion.

Is it though? Claude's reliability is now at an all-time low of 98.7%. It's not a stretch to think that large companies will have second doubts about about adopting claude for their production environments.


Waiting lists are a thing.

> You can't stop accepting new customers unless you're fine with killing your potential future customer base. That's a ridiculous suggestion.

what? they already have, they aren't releasing mythos except to a limited pre-approved customer base who is practically begging them to take their money. they can do that for lower tier models and at this point they should.


Their rationale provided for that is safety-based, not infrastructure-based.

>You can't stop accepting new customers unless you're fine with killing your potential future customer base. That's a ridiculous suggestion.

And yet, it's what any business with limited stock or slots (from restaurants and car companies to airlines) have done since forever...


Calling your own article all those things is a major turn-off.

Piece of free advice towards a better civilisation: people who didn't even read the comment they're replying to shouldn't be rewarded for their laziness.

I read his comment and still replied. I think his claim that nobody reads thinking blocks and that thinking blocks increase latency is nonsense. I am not going to figure out which settings I need to enable because after reading this thread I cancelled my subscription and switched over to Codex. Because I had the exact same experience as many in this thread.

Also what is that "PR advice"—he might as well wear a suit. This is absolutely a nerd fight.


Alright, I just tested that setting and it doesn't work.

https://i.imgur.com/MYsDSOV.png

I tested because I was porting memories from Claude Code to Codex, so I might as well test. I obviously still have subscription days remaining.

There is another comment in this thread linking a GitHub issue that discusses this. The GitHub issue this whole HN submission is about even says that Anthropic hides thinking blocks.


How are you porting over your memories, skills, commands (codex doesn't have commands).

I didn't use commands. I only used rules, memories, and skills. I asked Codex to read rules and memories from where Claude Code stores them on the filesystem and merge them into `AGENTS.md` and this actually works better because Anthropic prompts Claude Code to write each memory to a separate file, so you end up having a main MEMORY.md that acts as a kind of directory that lists each individual memory with its file name and brief description, hoping that Claude Code will read them, but the problem is that Claude Code never does. This is the same problem[0] that Vercel had with skills I believe. Skills are easy to port because they appear to use the same format, so you can just do `mv ~/.claude/skills ~/.codex/skills` (or `.agents/skills`).

[0]: https://vercel.com/blog/agents-md-outperforms-skills-in-our-...


What I was pointing out in my comment about the PR advice is that someone responding from a corporation to customers should be providing information to help the customer, nothing more.

Customers may want to fight - you seem to be providing an example - but representatives shouldn't take the bait.


Ok, so that’s good for Apple.

Microsoft is invested heavily in OpenAI?


It absolutely is RAM…

So much so that this was what made Apple increase their base sizes.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: