To address your actual pedantry, clearly you have some implicit normative belief about how a book about category theory should be written. That's cool, but this book has clearly chosen another approach, and appears to be clear and well explained enough to give a light introduction to category theory.
The syntax in the article is not scheme, you can clearly see it in my comment you're responding to.
As for your 'light introduction' comment: even ignoring the code, these are not pedantic complaints but basic mathematical and factual errors.
For example, the statement of Birkhoff’s Representation Theorem is wrong. The article says:
> Each distributive lattice is isomorphic to an inclusion order of its join-irreducible elements.
That is simply not the theorem. The theorem says "Theorem. Any finite distributive lattice L is isomorphic to the lattice of lower sets of the partial order of the join-irreducible elements of L.". You can read the definition on Wikipedia [0]
The article is plain wrong. The join-irreducibles themselves form a poset. The theorem is about the lattice of down-sets of that poset, ordered by inclusion. So the article is NOT simplifying, but misstating one of the central results it tries to explain. Call it a 'light introduction' as long as you want. This does not excuse the article from reversing the meaning of the theorem.
It's basically like saying 'E=m*c' is a simplification of 'E=m*c^2'.
> This does not excuse the article from reversing the meaning of the theorem.
What's with this hyperbole? Even the best math books have loads of errors (typographical, factual, missing conditions, insufficient reasoning, incorrect reasoning, ...). Just look at any errata list published by any university for their set books! Nobody does this kind of hyperbole for errors in math books. Only on HN do you see this kind of takedown, which is frankly very annoying. In universities, professors and students just publish errata and focus on understanding the material, not tearing it down with such dismissive tone. It's totally unnecessary.
I don't know if you've got an axe to grind here or if you're generally this dismissive but calling it "simply not the theorem" or "plain wrong" is a very annoying kind of exaggeration that misses all nuance and human fallibility.
Yes, the precise statement of Birkhoff's representation theorem involves down-sets of the poset of join-irreducibles. Yes, the article omits that. I agree that it is imprecise.
But it's not "reversing the meaning". It still correctly points to reconstructing the lattice via an inclusion order built from join-irreducibles. What's missing is a condition. It is sloppy wording but not a fundamental error like you so want us to believe.
Feels like the productive move here is just to suggest the missing wording to the author. I'm sure they'll appreciate it. I don't really get the impulse to frame it as a takedown and be so dismissive when it's a small fix.
At the lowest level of abstraction, LLMs are just matrix multiplication. Deterministic functions of their inputs. Of course, we can argue on the details and specifics of how the peculiarities of inference in practice lead to non-deterministic behaviours but now our model is being complicated by vague aspects of reality.
One convenient way of sidestepping these is to model them as random functions, sure. I wouldn't go as far to say they are "inherently stochastic creatures". Maybe that's the case, but you haven't really given substantial evidence to justify that claim.
At a higher level of abstraction, one possible model of llms is as deterministic functions of their inputs again, but now as functions of token streams or higher abstractions like sentences rather than the underlying matrix multiplication. In this case again we expect llms to produce roughly consistent outputs given the same prompt. In this case, again, we can apply deterministic theorems.
I guess my central claim is that there hasn't been a salient argument made as to why the randomness here is relevant for consensus. Maybe the models exhibit some variability in their output, but in practice does this substantially change how they approach consensus? Can we model this as artefacts of how they are initialised rather than some inherent stochasticity? Why not? It feels like randomness is being introduced here as a sort of magic "get out of jail" free card here.
LLMs utilize categorical distributions defined by the logits computed by the matrix multiplies, and there are many sampling strategies which are employed. This is one of the core mechanisms for token generation.
There's no peculiarity to discuss, that's how they work. That's how they are trained (the loss is defined by probabilistic density computations), that's how inference works, etc.
> I guess my central claim is that there hasn't been a salient argument made as to why the randomness here is relevant for consensus. Maybe the models exhibit some variability in their output, but in practice does this substantially change how they approach consensus? Can we model this as artefacts of how they are initialised rather than some inherent stochasticity? Why not? It feels like randomness is being introduced here as a sort of magic "get out of jail" free card here.
I'm really surprised to hear this given the content of the post. The claims in the post are quite strong, yet here I need to give a counterargument to why the claim about consensus applying to pseudorandom processes is relevant?
I don't think it's necessary to furnish a counterexample when pointing out when a formal claim is overreaching. It's not clear what the results are in this case! So it feels premature to claim that results cover a wider array of things than shown?
For instance, this is a strong claim:
> it means that in any multi-agentic system, irrespective of how smart the agents are, they will never be able to guarantee that they are able to do both at the same time:
>
> Be Safe - i.e produce well formed software satisfying the user's specification.
> Be Live - i.e always reach consensus on the final software module.
I'm confused as to the stance, we're either hand-waving, or we're not -- so which is it?
I just came away from the read thinking that this post was pointing to something very strong and was a bit irked to find that the state of results was more subtle than the post conveys it.
If you're pushing me, let's say we're not hand waving then. LLMs, abstraction removed, are deterministic computations of matrix-multiplication, f(x) -> y. If you want, we can make them pseudo-random, but thus still a deterministic process. FLP then holds. I'm not sure what your confusion is.
You’re right, there is that one example. Feels like we’re in exception that proves the rule territory. But I’d be very interested in being proven wrong! This isn’t a desire of mine, just what I’ve seen. Do you have other examples?
Also, part of my confidence comes from both having been a professional programmer for decades, across many languages, and also having programmed in Lean. It’s a great language for math, perhaps the best choice right now. But as a general purpose language it’s incredibly quirky.
Right, but what you're describing is a consensus protocol. It's called 2 phase commit. The point of the article is just that we should really be analysing these high level plans in terms of distributed algorithms terms, because there are fundamental limitations that you can't overcome.
you can still verify arbitrarily long running programs - there are instances of such software, such as sel4 (https://sel4.systems/) and certikos (https://flint.cs.yale.edu/certikos/), you simply model them as finite programs that run on an infinite stream of events.
> finite programs that run on an infinite stream of events
This requires coinduction, right? (That's my understanding of the formal representation of infinite streams.) If so, that does limit your options, since most of the proof assistants don't handle coinductive data, as I understand it.
Repeating myself, when we speak of bugs in a verified software system, I think it's fair to consider the entire binary a fair target.
If a buffer overflow causes the system to be exploited and all your bitcoins to be stolen, I don't think the fact that the bug being in the language runtime is going to be much consolation. Especially if the software you were running was advertised as formally verified as free of bugs.
Second, there was a bug in the code. Maybe not a functional correctness bug, but I, along with many and most end users, would consider a crashing program buggy. Maybe we just have different tastes or different standards on what we consider an acceptable level of software quality.
W.r.t people running Lean in production, you'd be surprised...
We're not speaking about bugs in a verified system so much as writing articles making specific claims about that. Surely if we're at the level of precision of formal verification, it's incumbent upon us to be precise about the nature of a problem with it, no? "Lean proved this program correct and then I found a bug" heavily implies a flaw in the proof, not a flaw in the runtime (which to my mind would also be a compelling statement, for the reasons you describe).
Or it implies a bug in the specification. The spec differing from the intent is a frequent source of bugs, it doesn't matter what language the spec is written in. Most people have experience with (or have seen news stories about) specification bugs in natural-language specifications: legal loopholes!
This is the biggest risk with the rejuvenated interest in formal proof. That LLMs can generate proofs is useful. Proof assistants that can check them (Lean/FStar/Isabelle/...) similarly so.
But it just moves the question to whether the theorems covered in the proof are sufficient. Underlying it all is a simple question:
Does the system meet its intended purpose?
To which the next question is:
What is the intended purpose?
Describing that is the holy grail of requirements specification. Natural language, behaviour-driven development, test-driven development and a host of other approaches attempt to bridge the gap between implicit purpose and explicit specification. Proof assistants are another tool in that box.
It's also one of the key motivators for iterative development: putting software in front of users (or their proxies) is still the primary means of validation for a large class of systems.
None of which is implied criticism of any of those approaches. Equally, none completely solves the problem. There is a risk that formal proofs, combined with proof assistants, are trumpeted as "the way" to mitigate the risk that LLM-developed apps don't perform as intended.
They might help. They can show that code is correct with respect to some specification, and that the specification is self-consistent. They cannot prove that the specification is complete with regards its intended purpose.
Sorry, I'm not sure I follow. We are talking about bugs in a verified system, that is, in this case, a verified implementation of a zlib-based compression tool. Did it have bugs? yes. Several in fact. I'd recommend reading the article for a detailed listing of the bugs in the tool.
> The most substantial finding was a heap buffer overflow! but, not in lean-zip's code, but in the Lean runtime itself. (emphasis mine)
> The OOM denial-of-service is straightforward: the archive parser was never verified. (me again)
"Lean proved this program correct" vs "I found bugs in the parts that Lean didn't prove was correct". The failure was not in Lean's proof (which as I said is heavily implied by the title), but in ancillary or unverified code.
Do I misunderstand how Lean works? I am by no means an expert (or even an amateur) on Lean or formal systems in general. Surely the first class of bug could be found in any code that uses the framework, and the second class of bug could happen to any system that isn't proven? Does that make sense? Otherwise where's the boundary? If you find a bug in the OS, does that mean Lean failed somehow? How about the hardware? If your definition of a 'formally verified system' goes beyond the code being verified and the most important thing is whether or not you can make it crash, then the OS and the hardware are also part of the system.
Of course bugs are important to users regardless of the cause, but the audience for your article seems to be software engineers and as a software engineer, your article was interesting and you found something useful, but the title was misleading; that's all I'm saying.
Further to my earlier reply - a more succinct way to look at it might be:
- When they fix the run time, bug A goes away. So the proof still holds and the zlib code is still correct.
- When they add a system of proofs for the parser and modify that, then bug B goes away. So the proof still holds and the zlib code is still correct; and now more of the library is proven correct.
The formulation of the title is "I was told X but that's not true"... but that's not true. You were told X, and X is true, but you found Y and Z which are also important.
There are two different answers to this question, and which one is "correct" depends entirely on the context of who is asking it.
1. It's the code that is specific to this program that sits above the run-time layer (internal view, that most programmers would take).
2. It's the code in the binary that is executed (external view, that most users would take).
The key question does not seem to be "was the proof correct", rather "did the proof cover everything in the program". The answer depends on whether you are looking at it from the perspective of a programmer, or a user. Given the overly strong framing that the article is responding to - highlighting the difference in this way does seem to be useful. The title is correct from the perspective that most users would take.
Yes but, without wishing to be snarky, did you read the article? There is no program as such, in either sense - the announcement from Lean only mentions "a C compression library" (zlib). Not only that, but since we're talking about formal verification, a programmer would likely understand that that is about proving a bounded, specific codebase at source code level, and not operating on a binary along with its associated dependencies (again caveat my limited understanding of these things).
My feeling is that if you told the average non-technical user that some person/organisation had produced a formally verified version of a C compression library, you would likely get a blank look, so I think it's reasonable to assume that both Lean's intended audience, and the audience of the blog post linked here, correspond with number 1. in your list.
The article describes fuzzing the library, this execution requires a program to be compiled. Typically fuzzing involves a minimal harness around the payload (a single call into the library in this case). There is clearly a bug in this program, and it does not exist in the minimal harness. It must be in the library code, which was covered by the proof.
The bounded, specific codebase that you refer to is typically the library *and all of its dependencies*, which in this case includes the Lean runtime. This is why formal verification is difficult: the proof chain needs to extend all the way down to the foundations. In this case the original gushing claim that everything was verified is incorrect and premature. The article seems like a good exposition of why.
Thank you, I understand what fuzzing is; that test harness was presumably provided either by the blog post author or generated by Claude somehow, and therefore would not have been part of the proven code, nor part of the original claim by the Lean devs. That's what I meant by saying there is no program as such.
> The bounded, specific codebase that you refer to is typically the library and all of its dependencies, which in this case includes the Lean runtime.
How does that work? I thought the main idea is to write code in the Lean language which has some specific shape conducive to mathematical analysis, along with mathematical proofs that operate on that code. How then does a system like this handle a third party dependency? I've searched around and I can't find any information about how it works. I assumed that the boundary of the proof was the source code - surely they can't also be proving things like, say, DirectX?
> This is why formal verification is difficult: the proof chain needs to extend all the way down to the foundations.
The difficulty is not in explosions of computational complexity due to problems of incompleteness, decidability, the halting problem, those kinds of things? As I said this is not something I know much about but it's surprising to me if 'analysing all the code' is really the difficult bit.
There are standard convex assumptions to handle incompleteness, decidability etc, i.e. the results are an over-approximation that terminates. Picking a approximation that is precise enough in the properties that you care about is part of the challenge, but it is an in-band problem. There are no hard edges between the theory and reality.
As with most engineering problems the out-of-band issues tend to be the hardest to solve. Models of the part underneath the interesting part need to be complete/accurate enough to make the results useful. Compare it to crypto where people do not usually try to break the scheme - they try to break the specific implementation of the scheme because the weakest points will be at the interface between the theoretical construction and the actual concrete instantiation of the device that it will run on.
I am pretty sure you could tell a teenager "there's a ZIP compression program that's scientifically proven to have no bugs" and they'd understand you. People don't have to be CS experts to understand that. (Technically it's Gzip but that's mostly irrelevant to understanding the claim here)
Gentle reminder about this excerpt from HN Guidelines:
> Please don't comment on whether someone read an article. "Did you even read the article? It mentions that" can be shortened to "The article mentions that".
Say you study some piece of software. And it happens that it has an automated suite of tests. And it happens that some files aren't covered by the test suite. And you happen to find a bug in one of those files that were not covered.
Would you publish a blog post titled "the XXX test suite proved there was no bug. And then I found one"?
These were my thoughts as well and it's nothing new, I think, regarding testing altogether:
- testing libraries (and in this case - language itself) can have bugs
- what is not covered by tests can have bugs
Additionally would add that tests verify the assumptions of coder, not expectations of the business.
To give benefit to the author - I'd read the article as: having tests for given thing ensures that it does the thing that you built the tests for. This doesn't mean that your application is free of bugs (unless you have 100% coverage, can control entire state of the system, etc) nor that it does the right thing (or that it does the thing the right way)
I like to differentiate between coverage by lines and semantic coverage: sometimes you need to exercise a single line multiple times to get full semantic coverage, and better semantic coverage usually beats larger line coverage for detecting problems and preventing regressions.
Yes, mutation testing and similar techniques like fuzzing can help with it, but sometimes you want to be more deterministic: there are usually a lot of hidden side effects that are not obvious, and probably a source of majority of software bugs today.
Eg. something as simple as
function foo(url) {
data = fetch(url);
return data;
}
has a bunch of exceptions that can happen in fetch() not covered with your tests, yet you can get 100% line coverage with a single deterministic happy path test.
Basically, any non-functional side-effect behavior is a source of semantic "hiding" when measuring test coverage by lines being executed, and there is usually a lot of it (logging is another common source of side-effects not tested well). Some languages can handle this better with their typing approaches, but ultimately, there will be things that can behave differently depending on external circumstances (even bit flips, OOMs, full disk...) that were not planned for and do not flow from the code itself.
You're technically right, but what things are versus how they're promoted or understood by most people (rightfully or not) often diverges, and therefore such "grounding" articles are useful, even if the wording addresses the perceived rather than the factual reality.
By way of analogy, if there was an article saying "I bought a 1Tb drive and it only came with 0.91 terabits", I think if you started explaining that technically the problem is the confusion between SI vs binary units and the title should really be "I bought a 0.91 terabit drive and was disappointed it didn't have more capacity", technically you'd be right, but people would rightfully eyeroll at you.
To be clear, I think the article was fine and the author did some useful work (finding a bug in the runtime of a supposedly provably correct system is indeed a valuable contribution!). I don't agree that it's pedantic to explain why the title feels like a bait-and-switch (and thus does a disservice to the article itself). It's just a bit of feedback for future reference.
I take some comfort from being technically correct though; it's widely accepted by all of us pedants that that is the best kind of correct ;-)
> Repeating myself, when we speak of bugs in a verified software system, I think it's fair to consider the entire binary a fair target.
Yes, and that would be relevant if this was a verified software system. But it wasn't: the system consisted of a verified X and unverified Y, and there were issues in the unverified Y.
The article explicitly acknowledges this: "The two bugs that were found both sat outside the boundary of what the proofs cover."
1/ lean-zip is open source so it's much easier to have more Claude's eyes looking at it
2/ I don't think Claude could prove anything substantial about the zip algorithm. That's what lean is for. On the other side, lean could not prove much about what's around the zip algorithm but Claude can be useful there.
> I don't think the fact that the bug being in the language runtime is going to be much consolation. Especially if the software you were running was advertised as formally verified as free of bugs.
Reminds of what some people in the Rust community do: they fight how safe this is or not. I always challenge that the code is composed of layers, from which unsafe is going to be one. So yes, you are righ to say that. Unsafe means unsafe, safe means safe and we should respect the meaning of those words instead of twisting the meaning for marketing (though I must say I heard this from people in the community, not from the authors themselves, ever).
Totally agreed. For instance that's why sel4 just throws the whole binary at the proof engine. That takes any runtime (minimal in sel4's case) and compiler (not nearly as minimal) out of the TCB.
Hi Kiran, thanks for following up. FWIW, I enjoy your blog and your work. And I do think it was a valuable bug you found; also nice to see how quickly Henrik fixed it.
Say more about people running Lean in production. I haven’t run into any. I know of examples of people using Lean to help verify other code (Cedar and Aeneas being the most prominent examples), but not the actual runtime being employed.
I took a quick scan of lean-lang.org just now, and, other than the two examples I mentioned, didn’t see a single reference to anything other than proving math.
I’m sure you’re in the Lean Zulup, based on what you’ve been up to. Are you seeing people talk about anything other than math? I’m not, but maybe I’m missing it.
Yes, here's a concrete example: https://github.com/leanprover/SampCert This is an implementation of a verified sampler, in lean. Not an embedding in some other language. The implementation itself is in lean, and a python ffi is used to call into the verified implementation. I don't know if AWS is big enough for your standards, but here is at least one example. Besides that, I'm more reporting on the general vibe I have observed from numerous talks at AI4maths workshops at Neurips, at the DARPA AI4Math ExpMath kickoff, etc. People are considering Lean as a serious programming language. Maybe that's surprising to the mathematicians, but as a PL person, I find the language really nicely designed and I can understand why people want to write in it.
You’re right, I should have been more careful in my reference to AWS. No need to be snarky about it.
Let me rephrase: aside from that one example from a couple years ago, I haven’t seen any examples of production code written in Lean. I’d be very interested in being proven wrong, this isn’t something I desire, just what I’ve observed. Have you seen any others?
More generally, you implicitly make a good point: writing important libraries in Lean and calling in from another language is probably the most likely use case. So not programs / apps / binaries written in Lean, but small critical components.
> if the software you were running was advertised as formally verified as free of bugs.
Nobody should be advertising that. Even ignoring the possibility of bugs in the runtime, there could also be bugs in the verifier and bugs or omissions in the specification. Formally verified never means guaranteed to be free of bugs.
Hi! Author here. When we speak of bugs in a verified software system, I think it's fair to consider the entire binary a fair target.
If a buffer overflow causes the system to be exploited and all your bitcoins to be stolen, I don't think the fact that the bug being in the language runtime is going to be much consolation. Especially if the software you were running was advertised as formally verified as free of bugs.
Secondly, I did find a bug in the algorithm. in Archive.lean, in the parsing of the compressed archive headers. That was the crashing input.
> I think it's fair to consider the entire binary a fair target.
Yes, it's still very much a bug. But it has nothing to do with your program being formally verified or not. Formal verification can do nothing about any unverified code you rely on. You would really need a formal verification of every piece of hardware, the operating system, the runtime, and your application code. Short of that, nobody should expect formal verification to ensure there are no bugs.
I read it as that’s also the point. Adding formal verification is not a strict defense against bugs. It is in a way similar to having 100% test coverage and finding bugs in your untested edge cases.
I don’t think the author is attempting to decry formal verification, but I think it a good message in the article everyone should keep in mind that safety is a larger, whole system process and bugs live in the cracks and interfaces.
You're right. It just seems as though it should be self-evident. Especially to those sophisticated enough to understand and employ formal verification.
It does seem that way doesn't it? But as software bugs are becoming easier to find and exploit, I'm expecting more and more people, including those not "sophisticated enough" to understand and employ formal verification to start using it
Then it would help to not introduce any confusion into the ecosystem by using a click-baity title that implies you found a bug which violated the formal specification.
> When we speak of bugs in a verified software system, I think it's fair to consider the entire binary a fair target.
Yeah, I would actually agree. We wouldn't want to advertise that a system is formally verified in some way if that creates a false sense of security. I was just pointing out that, by my reading, the title appears to suggest that the core mechanism of the Lean proof is somehow flawed. When I read the title, I immediately thought, "Oooh. Looks like someone demonstrated a flaw in the proof. Neat." But that's not what is shown in the article. Just feels a bit misleading is all.
OK, so now I see the shadow edit you did for the code source, thanks. Unfortunately, it shows that you are incorrect. For one, the function is a private function and can only be called by local code. Everywhere that the function is called, the size given to it is verified by the program; there is even a note that says it limits the maximum zip file size to avoid a zip bomb. In addition, the code you are quoting isn't even the final code; it is an interim step from what Claude was iterating on. Sucks that this got so much traction, as you are purposely being deceptive in trying to say that this is a bug. You intentionally removed the 'private' keyword in the function signature, as you knew that it would tip off most people to then check when it is actually used.
sorry to hijack the thread. Really cool post. How long did the whole exercise including porting zlib to lean take?
i have a hard real time system that i would love to try this on, but that's a lot of tools to learn and unclear how to model distributed systems in lean.
also, please add rss so i could subscribe to your blog
Lean-zip was not my project but one by others in the lean community. I'm not sure about the methodological details of their process - you might want to check with the original lean-zip authors (https://github.com/kim-em/lean-zip)
I think couching the success and excitement around rust to ideology or a "cult" as you say is somewhat digging your head into the sand. There are concrete facts and results. Rust is empirically producing levels of memory safety that humanity did not think was possible with software at scale. This is truly groundbreaking.
So to clarify, the crusade to rewrite things in Rust is not my fight. To be honest, I'm more a third party watching from the sidelines. There seems to be big institutional interest in replacing C with Rust. Not for ideological reasons. Just for plain economic ones. Rust code is breaking industry standards of memory safety/bug density/review time. I link it in the article, but for example, darpa has a recent big grant program, TRACTOR: Translating All C to Rust (https://www.darpa.mil/research/programs/translating-all-c-to...).
My article was more a commentary aimed at the efforts towards doing that. In some sense arguing that there are some foundational formal deficiencies currently that mean that it's not even clear what success would be. I guess I don't really take a position on the value of rewriting things into Rust, aside from adopting the views of these existing programs as some prelude.
If it is your article that makes the title clickbait.
As others have said, it is ridiculous to suggest the code which controls Christmas tree lights must be written in Rust. In your example it would be easier to follow the machine code generated (which only has "if' and "go to") in C than in Rust.
w.r.t the first point, so ideally you wouldn't want to do that because it'd incur a heavy runtime performance. Rust's memory analysis allows eliminating those kinds of memory bugs without having to check writes at runtime.
w.r.t the second point, I talk a bit about that in the article itself -- the fundamental problem right now is that there's no real formal way of even stating what it means to correctly translate a program from C to Rust. We could maybe have a smart LLM that translates things to Rust, but would you trust it without tests? or ideally a proof of correctness? what properties should we test? etc.
Ah! You're talking about Racket or Scheme!
```
> (sort '(3 1 2) (lambda (a b) (< a b)))
'(1,2,3)
```
I suppose you ought to go and tell the r6rs standardisation team that a HN user vehemently disagrees with their api: https://www.r6rs.org/document/lib-html-5.96/r6rs-lib-Z-H-5.h...
To address your actual pedantry, clearly you have some implicit normative belief about how a book about category theory should be written. That's cool, but this book has clearly chosen another approach, and appears to be clear and well explained enough to give a light introduction to category theory.