I like that the output rendering is closer to typical UIs -- syntax highlighting in code mode, tool calls, dim-italic reasoning.
One feature mine has that the author, or anyone else who vibe codes their own version after seeing this, might like to steal is modeling the distribution of output latencies. My implementation is hacky (log-normal roughyl estimated from p50, p90, and p99 values), but still, when you set those to realistic values, it recreates the "jitter" you see in many LLM UIs.
antirez is right that generation tok/s isn't flat as a function of context length, which is a weakness of both simulators.
To clarify: we do content-based hashing, and when we say "shared bytes aren’t guaranteed to be in the exact same container image layer", what we mean is that
FROM some/image
RUN pip install torch==2.7.1
and
FROM another/image
RUN pip install torch==2.7.1
will produce images with very high overlap in contents, which will be shared by a content-based cache, but those images' final layers are disjoint from the perspective of a layerwise cache.
Yep! That should start in ten seconds or so -- about a second per gigabyte of weights, plus a second to start the container and a few seconds to load the memory snapshot.
How can you cut latency by more than 1x? I am no intending to be snarky, it just doesn’t fit my brain how you can reduce a measure time by more than the original starting time.
> Ed Zitron also called out the business model of GPU-as-a-service middleman companies like modal deeply unsustainable, and I also don't see how they can make a profit if they are only reselling public clouds.
You got a link for that? I work on Modal and would be interested in seeing the argument!
We think building a proper software layer for multitenant demand aggregation on top of the public clouds is sufficient value-add to be a sustainable business (cf DBRX and Snowflake).
Snowflake and Databricks provide data storage and pipeline features and therefore have extraordinary lock-in potential, which allows them to have sustainable business models.
GPU compute is essentially fungible. That's quite a stretch to compare those business models. Snowflake and Databricks don't necessarily have the best "value-add" and they don't need to.
Sorry to lead with a bunch of jargon! Wanted to make it obvious that we'd give concrete recommendations instead of palaver.
The technical terms there are later explained and diagrammed, and the recommendations derived from something close to first principles (e.g. roofline analysis).
Thanks! I think computers are fun and I want reading about them to be fun too.
I was also reminded of HazyResearch's MegaKernels. Didn't want to distract from the main thrust of the post, but definitely think that's a promising approach.
Reductively, software engineering means taking an idea and mapping it into code. So one form of "reverse" engineering would be taking the code and extracting the ideas. That's what we did here.
Because the source is public, there's quite a lot to work with from the start -- the warp specializations are named and there are helpful comments in many places.
But for many components, we didn't have much. Maybe the clearest case of "reverse engineering" explained in the post is with the cubic approximation for the rational part of the exponentiation. That required staring at some inline assembly and doing math.
I've never heard of this definition of reverse engineering -- when one has the unobfuscated actual source code I'd usually call it: reading the code, or something like summarization.
Not trying to be uncharitable, I found your article informative. Reverse engineering has historically been reserved for cases where there is an adversial aspect, as in binaries or server APIs. Anyhow, Cheers and thank you, sincerely.
That is the traditional explanation of why it is called reverse engineering. The term originated in hardware engineering. When it was originally applied to software, it was common to create requirements documents and design documents before coding, even if the actual process did not strictly follow the "waterfall" idea.
Thus it was natural to call the process of producing design documents from undocumented software "reverse engineering". These days coding without any formal design documents is so common that it seems the original meaning of reverse engineering has become obscured.
What time period and area did you come across this usage? As I ever saw it used, 'reverse engineering' generally referred to creating docs from executables or watching network protocols rather than from source.
Back in the 1990's. As an example, back then the Rational Rose design software had a feature to generate UML diagrams from existing source code, and it was called "reverse engineering".
Having the source code and understanding how it works is two different things, especially when running on state of the art hardware. If I had just read the source I would not have gained as much knowledge as this article taught me. Where did this extra info come from? They read the source too, but then they did something more. I wouldn’t call it summarization either, as again any summary I wrote about the code would pale in comparison.
You guys are being obtuse. Engineering is turning a spec into a more technical artifact, whether that's source code, machine code, physical hardware or something else. Reverse engineering is then reserving the process of engineering, recovering the semantic artifact from the engineering artifact. That the OP is using the term in the sense of recovering the semantic insights from the cuda kernels is a fine application of the concept.
> Unless you've actually watched tokens stream at those rates, the numbers are hard to internalize. This is the rendering.
I built something similar recently, for the same reason: https://modal.com/llm-almanac/token-timing-simulator.
I like that the output rendering is closer to typical UIs -- syntax highlighting in code mode, tool calls, dim-italic reasoning.
One feature mine has that the author, or anyone else who vibe codes their own version after seeing this, might like to steal is modeling the distribution of output latencies. My implementation is hacky (log-normal roughyl estimated from p50, p90, and p99 values), but still, when you set those to realistic values, it recreates the "jitter" you see in many LLM UIs.
antirez is right that generation tok/s isn't flat as a function of context length, which is a weakness of both simulators.
reply