Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Yea. LLM inference requires batch processing to have a shred of hope at being cost efficient. Batch processing requires a not so insignificant amount of scale (but probably not as much as people think).

I'm very pro local models, but not to have parity with SoTA frontier models. Just contextually trained small models doing smaller specific tasks.

Trying to run bigger LLMs for an individual user to do big tasks is not going to be a good time.



I think people overstate the importance of batch processing and understate the importance of consistent workloads.

A GPU is much like a hotel or an airplane. To make hotel and airplane economics work out, you need to keep your seats / rooms filled at all times. Because the number of spots is limited, an empty seat, room or request slot today is an opportunity forever wasted.

Unlike hotels and airplanes, though, GPU requests are short-lived. This means you need somebody to provide you with useful work to do 24/7. If you're an individual who needs to sleep or a single-country business that works 9 to 5, most of your request processing opportunities will go to waste. You're much better off renting your GPU from somebody who can provide it with useful work 24/7, and hence get much better per-request prices.


Wasnt this pretty evident to pretty much anyone who knew even a bit about inferencing?

Idk what people were thinking. I’ve never seen anyone offer a plausible way to sidestep batch processing for example.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: