For me open source means that the entire training data is open sourced as well as the code used for training it otherwise it's open weight. You can run it where you like but it's a black box. Nomic's models are good example of opensource.
Even with all training data provided, won't it still be a black box? Unless one trains it exactly the same, in the exact same order for each piece of data, potentially requiring the exact same hardware with specific optimizations disabled due to race conditions, etc., the final weights will be different, and so knowing if the original weights actually contain anything extra still leaves any released weights as a black box, no? There isn't an equivalent of reproducible builds for LLM weights, even if all of this was provided, right?
> My personal headcanon: this tooling works well when built on simple patterns, and can handle complex work. This tooling has also been not great at coming up with new patterns, and if left unsupervised will totally make up new patterns that are going to go south very quickly. With that lens, I find myself just rewriting what Claude gives me in a good number of cases.
I've been doing a greenfield project with Claude recently. The initial prototype worked but was very ugly (repeated duplicate boilerplate code, a few methods doing the same exact thing, poor isolation between classes)... I was very much tempted to rewrite it on my own. This time, I decided to try and get it to refactor so get the target architecture and fix those code quality issues, it's possible but it's very much like pulling teeths... I use plan mode, we have multiple round of reviews on a plan (that started based on me explaining what I expect), then it implements 95% of it but doesn't realize that some parts of it were not implemented... It reminds me of my experience mentoring a junior employee except that claude code is both more eager (jumping into implementation before understanding the problem), much faster at doing things and dumber.
That said, I've seen codebases created by humans that were as bad or worse than what claude produced when doing prototype.
I just run CC in a VM. It gets full control over the VM. The VM doesn't have access to my internal networks. I share the code repos it works on over virtiofs so it has access to the repos but doesn't have access to my github keys for pushing and pulling.
This means it can do anything in the VM, install dependencies, etc... So far, it managed to bork the VM once (unbootable), I could have spent a bit of time figuring out what happened but I had a script to rebuild the VM so didn't bother. To be entirely fair to claude, the VM runs arch linux which is definitely easier to break than other distros.
That's not helped by a recent change to their system prompt "acting_vs_clarifying":
> When a request leaves minor details unspecified, the person typically wants Claude to make a reasonable attempt now, not to be interviewed first. Claude only asks upfront when the request is genuinely unanswerable without the missing information (e.g., it references an attachment that isn’t there).
> When a tool is available that could resolve the ambiguity or supply the missing information — searching, looking up the person’s location, checking a calendar, discovering available capabilities — Claude calls the tool to try and solve the ambiguity before asking the person. Acting with tools is preferred over asking the person to do the lookup themselves.
> Once Claude starts on a task, Claude sees it through to a complete answer rather than stopping partway. [...]
In my experience before this change. Claude would stop, give me a few options and 70% of the time I would give it an unlisted option that was better. It actually would genuinely identify parts of the specs that were ambiguous and needed to be better defined. With the new change, Claude plows ahead making a stupid decision and the result is much worse for it.
When I was an exchange student at RIT and had just arrived from France a month before, one of the admin staff invited me and a friend in the same situation for thanksgiving because she didn't want to leave us by ourselves for a major holiday.
I have fond memories of that kindness.
I tend to find that for things like this that are really math heavy, it's usually better to create a DSL (or create easily readable function calls, etc) that you can easily write yourself instead of relying on AI to understand math heavy rules.
Bonus points, if the rules are in an easily editable format, you can change them easily when they need to. It seems that was the path the author took...
And yes this kind of use-case is exactly where unit tests shine...
> create a DSL (or create easily readable function calls, etc)
These aren't really that different. Consider the history of the earliest (non-assembly) programming languages, particularly https://en.wikipedia.org/wiki/Speedcoding , as well as the ideas expressed by Lisp.
Oh yeah, that's why I added the parenthesis. I consider lisp macros to be a dsl and that's exactly what I tend to like using. Similarly with ruby and some meta programming tricks.
I do the opposite, set up everything myself in terms of architecture/design of the software, so the AI can do the boring boilerplate like "math heavy rules". Always interesting to see how differently we all use LLMs.
I've usually not been impressed in AI's implementation of math heavy rules so I wouldn't trust it much and I tend to find it easier for me to write them myself and then verify :) Yup, it's always interesting to see the different usages.
That's what the French government paid per year per student at my engineering school in the early 2000s. Tuition fees paid by the student were 540 euros a year, but the cost to the government was quite high.
France is the same, the better universities are all public. But I know that the government spent an average of 35,000 euros per students at top public engineering schools in the early 2000s, not sure nowadays, so they do have funds it's just that the way of bringing money depends on actually being great academically.
reply