Features:
- Single file, no dependencies
- GGUF format parser
- Llama 3 tokenizer
- Support Llama 3, 3.1 (ad-hoc RoPE scaling) and 3.2 (tie word embeddings)
- Fast matrix-vector multiplication routines for Q4_0 and Q8_0 quantized tensors using Java's Vector API
- GraalVM's Native Image support
- AOT model preloading for instant time-to-first-token
Llama3.java: featuring .GGUF file format support, Q8_0 and Q4_0 quantizations, fast matrix/vector multiplication routines using Java's Vector API; served by a simple CLI with a --chat mode to interact with the Llama 3 models.
This will boost adoption at so many levels:
- Importing a Truffle language as a regular Maven dependency
- Ease integration with mainstream package managers
- Ability to update to the latest language version, independently of the JVM used
I'm still in awe at how smooth the Truffle "unchaining" worked out with no API changes (just a few necessary additions).
Author here: I implemented several versions of matmul with different unrolling schemes using the Vector API and I got a ~4X speedup with a single thread, but the speedup fades the more threads you add. I think that performance is constrained by memory bandwidth which is saturated with a small number of threads, regardless of vectorization.
Hey man, awesome stuff. Surely any JIT compiler will struggle to vectorize something using IntStream.range, though? Looking at matmul, I'd not expect that to be auto-vectorized. The Panama API can be used to do a matmul vectorization, too bad it seems to never launch.
GraalVM team member here.
Implementing any mainstream language is indeed a challenge, more so if you have to maintain bug-compatibility and cope with all the bits of bad design that went through the cracks in the de-facto implementation.
Truffle is not for beginners, but knowing the basic set of features e.g. partial evaluation, deoptimization... can get you very far already e.g. you can easily speedup any interpreter by 10X or more with minimal changes.
How long does take to implement a programming language? Well, from hours to years... depending on the language.
To make my point; how long would it take to implement a JVM? A JVM is a complex beast, so I would myself guess from years to a decade probably, what if I told you, that Espresso was written in just 6 months by an intern and a seasoned engineer... in just 6 months it was able to Minecraft and even run itself.
I assure you there's no magic here, and certainly no blinding talent either; the only reason for this unheard productivity was Graal/Truffle.
So, whenever I talk about Espresso I always give all credit to Graal/Truffle, it is a sublime platform for implementing fast languages and runtimes, of which Espresso is just a byproduct.
Just a tiny side note, a basic toy JVM is actually not that hard (without JIT, trivial GC, limited standard lib) from personal experience, of course a performant/having feature parity one is indeed impressive (though I yet to play with Espresso!)
No Python. No JNI. No native code. Just Java.
It’s (mostly) a single Java file implementing the full stack:
GGUF parsing, tokenization, Gemma 4 transformer inference, quantizations, CLI...
Built using the Java Vector API, with support for GraalVM Native Image.
reply