Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This is the reason, when I built a tool in the same space, I chose to benchmark with cost per correct answer.

Reducing tokens and also turns is quite worthless if the LLM doesn’t solve what you put it to do.

 help



Did you benchmark the competition and can we see?

No I don't have the funds to benchmark the competition, but would be happy to put the numbers up if any token whales feel like having a go.

https://github.com/jahala/tilth/tree/main/benchmark


Oh that is a nice approach whish more benchmarks did cost per successful

The problem even attempting to develop a tool for the frontier model space is that the cost to run a statistically significant benchmark is almost certainly going to be over $100 - for a single model.

Unless something is like 25%+ more cost effective on Gemini for a task, I would not assume those savings are going to transfer to GPT.

If you need to run a test this expensive and slow for every release, hobbiests aren't going to do it.

And if you wanted any broadly specific improvements to coding like they all claim, the costs would be in the thousands per release even for a single for a single model.

And they almost certainly would not be eye popping.

If the models could be SUBSTANTIALLY better, Google and Anthropic and OpenAI wouldn't be finding that out from a hobbiest making wildly unscientific claims.


Yup, this is hitting it on the nose. But, despite the cost - the benchmark is the vital ingredient that cant be skipped. Otherwise, you don't know if what you're building is actually helping the agent rather than hindering it.

On the previous large benchmark run, i proved 40-50% cost reduction per correct answer.

I'm not sure why the vendors aren't using token filtering/compression more in their tooling, but perhaps they don't mind users feeding them more data and using more data.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: