This is the reason, when I built a tool in the same space, I chose to benchmark ...

esafak · 2026-06-05T21:31:39 1780695099

Did you benchmark the competition and can we see?

jahala · 2026-06-06T08:14:59 1780733699

No I don't have the funds to benchmark the competition, but would be happy to put the numbers up if any token whales feel like having a go.

https://github.com/jahala/tilth/tree/main/benchmark

alex7o · 2026-06-06T11:44:18 1780746258

Oh that is a nice approach whish more benchmarks did cost per successful

onlyrealcuzzo · 2026-06-06T00:00:05 1780704005

The problem even attempting to develop a tool for the frontier model space is that the cost to run a statistically significant benchmark is almost certainly going to be over $100 - for a single model.

Unless something is like 25%+ more cost effective on Gemini for a task, I would not assume those savings are going to transfer to GPT.

If you need to run a test this expensive and slow for every release, hobbiests aren't going to do it.

And if you wanted any broadly specific improvements to coding like they all claim, the costs would be in the thousands per release even for a single for a single model.

And they almost certainly would not be eye popping.

If the models could be SUBSTANTIALLY better, Google and Anthropic and OpenAI wouldn't be finding that out from a hobbiest making wildly unscientific claims.

jahala · 2026-06-06T08:18:28 1780733908

Yup, this is hitting it on the nose. But, despite the cost - the benchmark is the vital ingredient that cant be skipped. Otherwise, you don't know if what you're building is actually helping the agent rather than hindering it.

On the previous large benchmark run, i proved 40-50% cost reduction per correct answer.

I'm not sure why the vendors aren't using token filtering/compression more in their tooling, but perhaps they don't mind users feeding them more data and using more data.