The discrete wavelet transform (DWT) compresses an image by repeatedly downscaling it, and storing the information which was lost during downscaling. Here's an image which has been downscaled twice, with its difference images (residuals): https://commons.wikimedia.org/wiki/File:Jpeg2000_2-level_wav.... To decompress that image, you essentially just 2x-upscale it, and then use the residuals to restore its fine details.
Wavelet compression is better than the block-based DCT for preserving sharp edges and gradients, but worse for preserving fine texture (noise). The DCT can emulate noise by storing just a couple of high-frequency coefficients for a 64-pixel block, but the DWT would need to store dozens of coefficients to achieve noise synthesis of similar quality.
The end result is that JPEG and JPEG 2000 achieve roughly the same lossy compression ratio before image artefacts show up. JPEG blurs edges, JPEG 2000 blurs texture. At very low bitrates, JPEG becomes blocky, and JPEG 2000 looks like a low-resolution image which has been upscaled (because it's hardly storing any residuals at all!)
FFmpeg has a `jpeg2000` codec; if you're interested in image compression, running a manual comparison between JPEG and JPEG 2000 is a worthwhile way to spend an hour or two.
I found a jpeg2000 reference PDF somewhere. It may as well have been written in Mandarin.
I got as far as extracting the width and height. Its much more advanced than jpeg. Forget about writing a decoder.
Both formats are DCT-based (except for lossless JPEG XL). JPEG 2000's use of the DWT was unusual; in general, still-image lossy compression research has spent the last 35 years iteratively improving on JPEG's design. This is partly for compatibility reasons, but it's also because the original design was very good.
Since JPEG, improvements have included better lossless compression (entropy coding) of the DCT coefficients; deblocking filters, which blur the image across block boundaries; predicting the contents of DCT blocks from their neighbours, especially prediction of sharp edges; variable DCT block sizes, rather than a fixed 8x8 grid; the ability to compress some DCT blocks more aggressively than others within the same image; encoding colour channels together, rather than splitting them into three completely separate images; and the option to synthesise fake noise in the decoder, since real noise can't be compressed.
You might be interested in this paper: https://arxiv.org/pdf/2506.05987. It's a very approachable summary of JPEG XL, which is roughly the state of the art in still-image compression.
Thanks. The paper is fascinating. I only skimmed around so far and it is full of interesting details. Even beyond compression. They really tried hard to make the USB of image formats, by supporting as many features and use cases as possible. Even things like multiple layers and non-destructive cropping. I like the section where they talk about previous image formats, why many of them failed and how they tried to learn from past mistakes.
Regarding algorithms: Searching for "learned image compression", there are a lot of research papers which use neural networks rather than analytic algorithms like DCT. The compression rates seem to already outperform conventional compression. I guess the bottleneck is more slow decoding speed than compression rate. At least that's the issue with neural video compression.
As I understand it, very small neural networks have already been incorporated into both VVC and AV2 for intra prediction. You're correct that this strategy is limited by decoding performance, especially when predicting large blocks.
In general, I'm pessimistic about prediction-and-residuals strategies for lossy compression. They tend to amplify noise; they create data dependencies, which interfere with parallel decoding; they require non-local optimisation in the encoder; really good prediction involves expensive analysis of a large number of decoded pixels; and it all feels theoretically unsound (because predictors usually produce just one value, rather than a probability distribution).
I'm more optimistic about lossy image codecs based on explicitly-coded summary statistics, with very little prediction. That approach worked well for lossy JPEG XL.
Everything after JPEG is still fundamentally the same, but individual parts of the algorithm are supercharged.
JPEG has 8x8 blocks, modern codecs have variable-sized blocks from 4x4 to 128x128.
JPEG has RLE+Huffman, modern codecs have context-adaptive variations of arithmetic coding.
JPEG has a single quality scale for the whole image, modern codecs allow quality to be tweaked in different areas of the image.
JPEG applies block coefficients on top of a single flat color per block (DC coefficient), modern codecs use a "prediction" made by smearing previous couple of block for the starting point.
Wavelet compression is better than the block-based DCT for preserving sharp edges and gradients, but worse for preserving fine texture (noise). The DCT can emulate noise by storing just a couple of high-frequency coefficients for a 64-pixel block, but the DWT would need to store dozens of coefficients to achieve noise synthesis of similar quality.
The end result is that JPEG and JPEG 2000 achieve roughly the same lossy compression ratio before image artefacts show up. JPEG blurs edges, JPEG 2000 blurs texture. At very low bitrates, JPEG becomes blocky, and JPEG 2000 looks like a low-resolution image which has been upscaled (because it's hardly storing any residuals at all!)
FFmpeg has a `jpeg2000` codec; if you're interested in image compression, running a manual comparison between JPEG and JPEG 2000 is a worthwhile way to spend an hour or two.