> do have long-term memory from their training and thus act very much like someone suffering from Alzheimer’s.
Your 8th grade science teacher may be disappointed too. Drawing such analogies using unequivocal language "very much like" disregards the limited understanding of LLMs, the false analogies between computer and biological systems, and the complex nature of Alzheimer's disease (no it is not just short term memory loss, not even close, for example ability to interpret images)
Hmm.The point was that people with Alzheimers have trouble interpreting images, and obviously remain concious until the latest stages of their disease.
> remain concious until the latest stages of their disease.
Are you saying that people with advanced Alzheimers lose consciousness? That's not the case. Although it might become hard for people with advanced Alzheimers to demonstrate their consciousness, that doesn't mean that their consciousness isn't there.
It's a dumb joke considering Germany has been one of the most peaceful countries in decades. And the people making the jokes are often citizens of a country actively engaged in wars.
Gearmany's pacifism is just like its green energy transition hypocrytical and ineffective. Their Energiewende was to shutdown nuclear to bring back coal. Their Zeitenwend amounted to bankrolling Putin's war machine via the Norstream pipelines at the expense of the very same countries they tried to anhiliate in WW2. So yeah, I think I can crack a joke.
What makes ggplot great is that it allows manual adjustments AND has a nice declerative grammar. Hard for me to see the value of a plotting library without being able to adjust plots.
Also a bad analogy. A slize of pizza has no onboarding cost for the user. You eat it and that is it. A PDF editor requires you to understand how to use it.
A better comparison would be a pizza shop at the end of a long hike that advertised itself online to offer infinite amount of free pizza. So you go on the hike and then it turns out you only get one slice and have to pay a fortune for the rest. You planned to get free food st the end of the hike, but it turns out the food you eventually will have to eat is not free and not even cheap.
It might work, I considered running a test like this. But it does demand certain things.
The subnetwork has to be either crafted as "gradient resistant" or remain frozen. Not all discovered or handcrafted circuits would survive gradient pressure as is. Especially the kind of gradients that fly in early pre-training.
It has to be able to interface with native representations that would form in a real LLM during pre-training, which is not trivial. This should happen early enough in pre-training. Gradients must start routing through our subnetwork. We can trust "rich get richer" dynamics to take over from there, but for that, we need the full network to discover the subnetwork and start using it.
And finally, it has to start being used for what we want it to be used for. It's possible that an "addition primitive" structure would be subsumed for something else, if you put it into the training run early enough, when LLM's native circuitry is nonexistent.
Overall, for an early test, I'd spray 200 frozen copies of the same subnetwork into an LLM across different layers and watch the dynamics as it goes through pre-training. Roll extra synthetic addition problems into the pre-training data to help discovery along. Less of a principled solution and more of an engineering solution.
+1 I’ve always had the feeling that training from randomly initialized weights without seeding some substructure is unnecessarily slowing LLM training.
Similarly I’m always surprised that we don’t start by training a small set of layers, stack them and then continue.
Better-than-random initialization is underexplored, but there are some works in that direction.
One of the main issues is: we don't know how to generate useful computational structure for LLMs - or how to transfer existing structure neatly across architectural variations.
What you describe sounds more like a "progressive growing" approach, which isn't the same, but draws from some similar ideas.
In terms of sub structure - in the old days of Core Wars randomly scattering bits of code that did things could pay off. I’m imagining similar things for LLMs - just set 10% of weights as specific known structures and watch to see which are retained / utilized by models and which get treated like random init
I had that in mind too. What if you handcraft a subnetwork with (some subset of) Turing machine capability? Do those kinds of circuits emerge naturally during training? Can reasoning use them for complex computation?
Perhaps the real issue is the gate-keeping scientific publishing model. Journals had a place and role, and peer-review is a critical aspect of the scientific process but new times (internet, citizien science, higher levels of scientific literacy, and now AI) diminish the benefits of journals creating "barriers to entry" as you put it.
I for one hope not to live in a world where academic journals fall out of favor and are replaced by vibe-coded papers by citizen scientists with inflated egos from one too many “you’re absolutely right!” Claude responses.
Me neither, but what you present is a false dichotomy. Science used to be a past time of the wealthy elites, it became a profession. By opening up it up progrss was accelerated. Same will happen when publication will be made more open and accessible.
Your 8th grade science teacher may be disappointed too. Drawing such analogies using unequivocal language "very much like" disregards the limited understanding of LLMs, the false analogies between computer and biological systems, and the complex nature of Alzheimer's disease (no it is not just short term memory loss, not even close, for example ability to interpret images)
reply