Hacker Newsnew | past | comments | ask | show | jobs | submit | charles_irl's commentslogin

Very cool!

> Unless you've actually watched tokens stream at those rates, the numbers are hard to internalize. This is the rendering.

I built something similar recently, for the same reason: https://modal.com/llm-almanac/token-timing-simulator.

I like that the output rendering is closer to typical UIs -- syntax highlighting in code mode, tool calls, dim-italic reasoning.

One feature mine has that the author, or anyone else who vibe codes their own version after seeing this, might like to steal is modeling the distribution of output latencies. My implementation is hacky (log-normal roughyl estimated from p50, p90, and p99 values), but still, when you set those to realistic values, it recreates the "jitter" you see in many LLM UIs.

antirez is right that generation tok/s isn't flat as a function of context length, which is a weakness of both simulators.


To clarify: we do content-based hashing, and when we say "shared bytes aren’t guaranteed to be in the exact same container image layer", what we mean is that

FROM some/image RUN pip install torch==2.7.1

and

FROM another/image RUN pip install torch==2.7.1

will produce images with very high overlap in contents, which will be shared by a content-based cache, but those images' final layers are disjoint from the perspective of a layerwise cache.


Thank you

Yep! That should start in ten seconds or so -- about a second per gigabyte of weights, plus a second to start the container and a few seconds to load the memory snapshot.

There are a few limitations with snapshotting, e.g. it generally fails when using multiple GPUs, which we document here: https://modal.com/docs/guide/memory-snapshots.


Cutting latencies by 40x! Unfortunately couldn't fit the whole title in the character limit :<

How can you cut latency by more than 1x? I am no intending to be snarky, it just doesn’t fit my brain how you can reduce a measure time by more than the original starting time.

Put differently, 1/40 is not the same as 1x - 40x. I’d phrase as Reduced by 97.5% or 0.975x

There are two ways to express such ratios, and both are equally valid. (Though "x" is usually reserved for "40x" and "%" for "97.5%".)

probably just AI slop and using wrong semantics, they mean speedup ratio.

You're absolutely right!

> Ed Zitron also called out the business model of GPU-as-a-service middleman companies like modal deeply unsustainable, and I also don't see how they can make a profit if they are only reselling public clouds.

You got a link for that? I work on Modal and would be interested in seeing the argument!

We think building a proper software layer for multitenant demand aggregation on top of the public clouds is sufficient value-add to be a sustainable business (cf DBRX and Snowflake).


Snowflake and Databricks provide data storage and pipeline features and therefore have extraordinary lock-in potential, which allows them to have sustainable business models.

GPU compute is essentially fungible. That's quite a stretch to compare those business models. Snowflake and Databricks don't necessarily have the best "value-add" and they don't need to.


It was on his last newsletter, but I can't link it right now.


Sorry to lead with a bunch of jargon! Wanted to make it obvious that we'd give concrete recommendations instead of palaver.

The technical terms there are later explained and diagrammed, and the recommendations derived from something close to first principles (e.g. roofline analysis).


oof ty, willfix


Inspired by https://float.exposed, which was on the front page recently, I put together this visualizer for lower precision/quantized floating point numbers -- specifically, all of the formats in the Open Compute Project's Microscaling Formats Spec (https://www.opencompute.org/documents/ocp-microscaling-forma...).


Thanks! I think computers are fun and I want reading about them to be fun too.

I was also reminded of HazyResearch's MegaKernels. Didn't want to distract from the main thrust of the post, but definitely think that's a promising approach.


There's some interesting work in NeurIPS this year on fused kernels for MoE too: https://flash-moe.github.io/


Hey, one of the authors here!

Reductively, software engineering means taking an idea and mapping it into code. So one form of "reverse" engineering would be taking the code and extracting the ideas. That's what we did here.

Because the source is public, there's quite a lot to work with from the start -- the warp specializations are named and there are helpful comments in many places.

But for many components, we didn't have much. Maybe the clearest case of "reverse engineering" explained in the post is with the cubic approximation for the rational part of the exponentiation. That required staring at some inline assembly and doing math.


I've never heard of this definition of reverse engineering -- when one has the unobfuscated actual source code I'd usually call it: reading the code, or something like summarization.

Not trying to be uncharitable, I found your article informative. Reverse engineering has historically been reserved for cases where there is an adversial aspect, as in binaries or server APIs. Anyhow, Cheers and thank you, sincerely.


That is the traditional explanation of why it is called reverse engineering. The term originated in hardware engineering. When it was originally applied to software, it was common to create requirements documents and design documents before coding, even if the actual process did not strictly follow the "waterfall" idea.

Thus it was natural to call the process of producing design documents from undocumented software "reverse engineering". These days coding without any formal design documents is so common that it seems the original meaning of reverse engineering has become obscured.


What time period and area did you come across this usage? As I ever saw it used, 'reverse engineering' generally referred to creating docs from executables or watching network protocols rather than from source.


Back in the 1990's. As an example, back then the Rational Rose design software had a feature to generate UML diagrams from existing source code, and it was called "reverse engineering".

https://en.wikipedia.org/wiki/IBM_Rational_Rose


Having the source code and understanding how it works is two different things, especially when running on state of the art hardware. If I had just read the source I would not have gained as much knowledge as this article taught me. Where did this extra info come from? They read the source too, but then they did something more. I wouldn’t call it summarization either, as again any summary I wrote about the code would pale in comparison.


I think "explained" is a reasonable term for this. If I remember correctly there where books of the form "The Linux Source Code Explained".

Certainly I can't get on board with reverse engineered.


That time when I reverse engineered JRR Tolkien‘s Lord of the rings from symbols engraved on dead trees. Took me three summers…


it’s more properly just software archaeology; recovering design intent from artifacts https://en.m.wikipedia.org/wiki/Software_archaeology


You've never had to reverse engineer the thinking and ideas that went behind code written by someone else/you a year ago?


No, because so far you "engineered" nothing. You just studied it, tried to understand it, and explain or teach it.

If you had reverse engineered it, you would have tried to "recreate something" that does not exist to do the same.

So, if you have a binary code, you recreate the source code that in theory could allow you to recreate the binary.

If you have the source code, I guess that would be when you are missing pieces of info that allows you to run this code like it is done by others...


Disagree that reverse engineering necessarily requires something to be recreated.

For example, simple hardware reversing can just be learning what, how and why something works, you don't need to "recreate" anything other than ideas.


You guys are being obtuse. Engineering is turning a spec into a more technical artifact, whether that's source code, machine code, physical hardware or something else. Reverse engineering is then reserving the process of engineering, recovering the semantic artifact from the engineering artifact. That the OP is using the term in the sense of recovering the semantic insights from the cuda kernels is a fine application of the concept.


I have to say this is kind of funny given that you also had this in the blog post:

> cudnn kernels are closed source, so Jensen only knows what’s going on in there.


It's the 'hacker' argument all over again.


I reverse engineered above comment by reading it and extracting the idea.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: