This sounds super interesting and relevant. I run a small cluster with H100s (often research projects with vLLM) and being able to see not just usage but efficiency would be great.
I don't fully get the 100% utilisation vs. 1-10% real compute. Given you rely on telemetry from users to add new models, are you trying to predict how fast a model should be on vLLM, compared to how it runs in practice? What if users tweak some hyperparameters?
What you described is the goal of Attainable SOL, but using GPU utilization as the metric rather than throughput. We're answering "for a given model and workload, have you optimized this well enough?", where "optimized" includes hyperparameter tuning. So if someone hasn't tuned batch size, parallelism, or other knobs well for their workload, the gap between their current utilization and the Attainable SOL is what tells them there's still room to improve.
We're motivated by the fact that reaching 100% Compute SOL is impossible -- no model can run at the hardware's theoretical maximum -- but we want to provide a realistic target for optimization. And we've noticed that different model architectures have different realistic ceilings. For example, MoE models run at much worse utilization due to their sparsity. We don't expect you to retrain an MoE model in order to get a higher utilization, and no hyperparameter tuning can bring you close to 100%, so the maximum attainable SOL should be lower for that model.
It's not a zero-sum game, you can both protect people and reap the benefits of health data. Many countries have much safer approaches. UK Biobank typically leads with the scale of the data, but not with its infrastructure.
That's a very important point. The people who opt out first are typically not a random fraction of the population, and this makes it much harder to make any analyses with the resulting datasets: it gets very hard to know if your analyses are representative of the population, or not.
This is why it was such a big deal when that researcher at Cleveland State misappropriated UKBB data for a race-science study with Emil Kirkegaard. After he was fired, people on Twitter were all like "this is just suppression of science", but the reality is that what they did, contravening UKBB rules, constituted potentially an existential threat to the whole program.
Good catch! The data is everywhere, re-uploaded every week.
I am aware of ~30 repositories that UK Biobank has asked GitHub to delete, and can still be found elsewhere online. They know the site, they have managed to delete data from that site before, and yet the files are still there.
You mean giving anyone access to the data? Or open sourcing the code? If the latter, I think that's a generally a good practice. Security through obscurity is never good for public infrastructure. In this case, UK Biobank has now switched to a remote access platform (not particularly secure, as the data was found for sale on Alibaba today), but contracting it to DNAnexus and Amazon. Private companies have no incentives to open source data, unless mandated to do so.
In the EU, there is a bigger interest in building scalable but also secure platforms for health data. Hopefully good innovation will come from there.
Is it me or they very carefully do not report performance on GPT-5.4 Pro, only the default GPT-5.4? They also very carefully left Anthropic models out of their comparison.
I went back to the BixBench benchmark which they mentioned. I couldn't find official results for Anthropic models, but I found a project taking Opus 4.6 from 65.3% to 92.0% (which would be above GPT-Rosalind) with nearly 200 carefully crafted skills [1]. There also appears to be competitive competitor models with scores on par with this tuned GPT.
Bix Bench seems like a really interesting/useful idea but most of the value for a layperson (like me) is comparing the results of different models on the benchmark. From what I can find there is no centralised & updated model results set. Shame.
reply