More

mukel · 2026-04-07T08:24:54 1775550294

I built llama3.java in the past, this is a follow-up: Gemma 4 running entirely on the JVM.

No Python. No JNI. No native code. Just Java.

It’s (mostly) a single Java file implementing the full stack:

GGUF parsing, tokenization, Gemma 4 transformer inference, quantizations, CLI...

Built using the Java Vector API, with support for GraalVM Native Image.

mukel · on Oct 14, 2024

Code: https://github.com/mukel/llama3.java

mukel · on Oct 14, 2024

Features: - Single file, no dependencies - GGUF format parser - Llama 3 tokenizer - Support Llama 3, 3.1 (ad-hoc RoPE scaling) and 3.2 (tie word embeddings) - Fast matrix-vector multiplication routines for Q4_0 and Q8_0 quantized tensors using Java's Vector API - GraalVM's Native Image support - AOT model preloading for instant time-to-first-token

mukel · on May 16, 2024

Llama3.java: featuring .GGUF file format support, Q8_0 and Q4_0 quantizations, fast matrix/vector multiplication routines using Java's Vector API; served by a simple CLI with a --chat mode to interact with the Llama 3 models.

mukel · on Sept 20, 2023

This will boost adoption at so many levels: - Importing a Truffle language as a regular Maven dependency - Ease integration with mainstream package managers - Ability to update to the latest language version, independently of the JVM used

I'm still in awe at how smooth the Truffle "unchaining" worked out with no API changes (just a few necessary additions).

mukel · on Aug 8, 2023

Author here: I implemented several versions of matmul with different unrolling schemes using the Vector API and I got a ~4X speedup with a single thread, but the speedup fades the more threads you add. I think that performance is constrained by memory bandwidth which is saturated with a small number of threads, regardless of vectorization.

mukel · on Aug 8, 2023

A Java port of llama2.c that performs very close to C on large models. Llama 2 7B runs at a whooping 1.6 tokens/s.

mike_hearn · on Aug 8, 2023

Hey man, awesome stuff. Surely any JIT compiler will struggle to vectorize something using IntStream.range, though? Looking at matmul, I'd not expect that to be auto-vectorized. The Panama API can be used to do a matmul vectorization, too bad it seems to never launch.

mwcampbell · on Aug 8, 2023

Panama is now in its third preview in the soon-to-be-released JDK 21:

https://openjdk.org/jeps/442

Is there any indication that it won't go from there to a final release soon?

mike_hearn · on Aug 9, 2023

That's only for the FFI I think. The vector API has been incubated six times now and is waiting for Valhalla :(

mukel · on Dec 21, 2022

GraalVM team member here. Implementing any mainstream language is indeed a challenge, more so if you have to maintain bug-compatibility and cope with all the bits of bad design that went through the cracks in the de-facto implementation. Truffle is not for beginners, but knowing the basic set of features e.g. partial evaluation, deoptimization... can get you very far already e.g. you can easily speedup any interpreter by 10X or more with minimal changes.

How long does take to implement a programming language? Well, from hours to years... depending on the language. To make my point; how long would it take to implement a JVM? A JVM is a complex beast, so I would myself guess from years to a decade probably, what if I told you, that Espresso was written in just 6 months by an intern and a seasoned engineer... in just 6 months it was able to Minecraft and even run itself. I assure you there's no magic here, and certainly no blinding talent either; the only reason for this unheard productivity was Graal/Truffle. So, whenever I talk about Espresso I always give all credit to Graal/Truffle, it is a sublime platform for implementing fast languages and runtimes, of which Espresso is just a byproduct.

kaba0 · on Dec 21, 2022

Thank you for the work on GraalVM!

Just a tiny side note, a basic toy JVM is actually not that hard (without JIT, trivial GC, limited standard lib) from personal experience, of course a performant/having feature parity one is indeed impressive (though I yet to play with Espresso!)

mukel · on Jan 20, 2021

The goal is not to compete, but rather complement HotSpot/GraalVM with:

- (Polyglot) scripting with Java

- Augmenting native images e.g. native javac with instant startup + annotation processors (very dynamic) running on Espresso

- A simple non-invasive JVM for constrained environments

- DCEVM-like features for developers

- Approachable academic playground

- Fast prototyping of JVM features e.g. it took our intern just two weeks to implement invokedynamic/MethodHandles

mukel · on Jan 20, 2021

It's a Linux-only trick: https://github.com/kt97679/tetris/blob/52dfb3a703e4dd5b37990...

exabrial · on Jan 20, 2021

Ok what on earth is going on there... It's launching a shell but I'm not sure what the stty does

shakna · on Jan 20, 2021

The commandline buffers input by default, because it starts in "cooked" mode.

That line turns off buffering so that keypresses can immediately be read, by putting it into "raw" mode.

exabrial · on Jan 20, 2021

excellent! thank you