More

olliepro · 2026-04-11T03:18:08 1775877488

They do quite a lot of distillation. As we've seen from the American open weight models from AI2 (OLMo series of models). They have a lot of incentive to distill beyond just copying, they're much more compute constrained, so open model companies distill, but also do really good architectural work to make their models run faster. Theres also technical challenges to distillation when all of the top models have their reasoning traces hidden, so we have to assume these open weight labs also have really great training pipelines as well.

olliepro · 2026-04-11T03:13:31 1775877211

A lot of distillation happens. E.g. OLMo models have a completely open dataset and they are heavily distilled. It only makes sense to try to absorb behaviors from the best models out there. That said, I think the open weight juggernaughts are doing really genuinely great work with RL, training environments, architectural innovations etc.

gorpy7 · 2026-04-11T03:30:09 1775878209

Thanks for the response. i had too many noodles tonight and forgot to check my writing. I’m a rare generalist and so it is so very hard to keep up with this without saying “better autocomplete” my one goal is to not get washed out like my parents did in the great username and password wars. i used to have this theory about knowledge in society/silos and i likened it to condensation on a window. you have all this water so close to each other and yet not touching-then, something happens and a bead runs down the window and it all connects. i guess distillation reminds me of it but ai overall reminds me of it. because we all know there are silos and complementary info just waiting to run together and make something happen. I am undoubtedly a naive optimist and believe there are good things coming. it’s not a popular opinion and i think that’s mostly because people would rather spend their time guarding than defining their future. oh baby, there are more noodles in the fridge and to think i almost left them at the restaurant.

olliepro · 2026-04-08T13:01:35 1775653295

This would likely only get used for small finetuning jobs. It’s too slow for the scale of pretraining.

onion2k · 2026-04-08T13:37:11 1775655431

It’s too slow for the scale of pretraining.

There isn't really such a thing as 'too slow' as an objective fact though. It depends on how much patience and money for electricity you have. In AI image gen circles I see people complaining if a model takes more than 5s to generate an image, and other people on very limited hardware who happily wait half an hour per image. It's hard to make a judgement call about what 'too slow' means. It's quite subjective.

jandrese · 2026-04-08T13:43:30 1775655810

If it would take so long to train that the model will be obsolete before the training is finished that might be considered too long. With ML you can definitely hit a point where it is too slow for any practical purpose.

ismailmaj · 2026-04-08T13:55:42 1775656542

Obsolete because of what? Because with limited hardware you’re never aiming for state of the art, and for fine-tuning, you don’t steer for too long anyway.

jandrese · 2026-04-08T14:06:15 1775657175

Because there is a new model that is better, faster, more refined, etc...

If your training time is measured in years or decades it probably won't be practical.

jwilber · 2026-04-08T14:29:33 1775658573

That’s just playing semantics. Nobody is talking about, “objective facts” or need define them here. If the step time is measured in days, and your model takes years to train, then it will never get trained to completion on consumer hardware (the entire point).

greenavocado · 2026-04-08T13:46:02 1775655962

So distribute copies of the model in RAM to multiple machines, have each machine update different parts of the model weights, and sync updates over the network

olliepro · 2026-04-08T16:34:50 1775666090

decentralized training makes a lot more sense when the required hardware isn't a $40K GPU...

olliepro · 2026-03-05T23:25:22 1772753122

I bet they lack good long context training data and need to start a flywheel of collecting it via their api (from willing customers)

jbergqvist · 2026-03-07T15:21:13 1772896873

This would be my guess too. It can probably be generated synthetically or via agentic rollouts, but high quality long context examples where outputs meaningfully depend on long-range interactions probably remain scarce

olliepro · 2026-02-11T19:20:41 1770837641

Tensors are in no shortage nowadays. I did read this a tensors though and got a good laugh.

olliepro · 2026-02-09T18:23:51 1770661431

There’s a section of I-15 in Utah’s Salt Lake County which reliably has a crash on weekdays at 6pm. It was unfortunately at a pinch point in the mountains with no good alternate route… very annoying.

In a similar way that Google Maps shows eco routes, it’d be fun for them to show “safest” routes which avoid areas with common crashes. (Not always possible, but valuable knowledge when it is.)

hananova · 2026-02-10T01:27:07 1770686827

That feels like it would cause induced demand for crashes though.

olliepro · 2026-02-03T22:51:37 1770159097

Much of the scientific medical literature is behind paywalls. They have tapped into that datasource (whereas ChatGPT doesn't have access to that data). I suspect that were the medical journals to make a deal with OpenAI to open up the access to their articles/data etc, that open evidence would rely on the existing customers and stickiness of the product, but in that circumstance, they'd be pretty screwed.

For example, only 7% of pharmaceutical research is publicly accessible without paying. See https://pmc.ncbi.nlm.nih.gov/articles/PMC7048123/

simianwords · 2026-02-03T23:12:26 1770160346

Do you think maybe ~10B USD to should cover all of them? For both indexing and training? Seems highly valuable.

Edit: seems like it is ~10M USD.

olliepro · 2026-01-27T21:23:39 1769549019

It depends on your thing. If the marathon was just the motivation, your thing is running... if the marathon was the bucketlist item, it is the thing.

olliepro · 2026-01-27T21:16:03 1769548563

Getting everyone to fall in love with the thing is not doing the thing... learned this as a data scientist brought in to work on a project which ended soon thereafter. A team of 20 people spent 1.5 years getting people to love an idea which never materialized. Time was wasted because the technical limitations and issues came too late... it died as a 40 page postmortem that will never see daylight.

samplatt · 2026-01-28T01:37:18 1769564238

I learned that lesson as a solo dev on a project that lasted a year, then learned it again as a team of 4 on a 2-year project. I've not had to learn the lesson again but I've certainly trod the same path... 20 people (including some VERY expensive contractors), 3.5 years, AU$80m to deliver what amounts to a timesheeting system that needs a team of 10 people manually massaging the data every month to make it work.

How do you not be "toxic" after that? How do you retain a chipper attitude when you know for a rock-solid certainty that even if the project is successful it's likely by accident?

olliepro · 2026-01-27T21:12:08 1769548328

Everyone's threshold is different. I aspire to "move fast and break things", but more often than not, I obsess over the rough edges.