Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Does anyone here have experience running large models in a multi-GPU setup with several RTX 6000s in a high-concurrency regime and with large context lengths? (something like Deepseek 4 Flash, Minimax 2.7 etc.)

For what it's worth, I've been seeing ~100 tps with 4-bit MiniMax 2.7 on two RTX 6000 boards, just running under llama-server without any optimization effort at all. I have no serious long-context experience with that setup, but at 30K context it's still above 90 tps.

If you are happy with Qwen 3.6 27B, I would personally switch the 5090 out for 2x RTX 6000s and keep running 27B. That will give you ~2x your current throughput with a lot more headroom for multiple users. More important, it would buy time to see how things develop over the next few months before you spend a whole lot of money.



With that amount of memory can you run 4-bit DeepSeek 4 Flash? It is way more efficient in the KV cache department so may be worth a try


I haven't looked into DS4 yet but based on antirez's results on 128 GB Macbooks, it shouldn't be a problem to run it on a pair of RTX6000 Pros.

Also see https://www.reddit.com/r/LocalLLaMA/comments/1sv649s/to_run_... .




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: