Lightning-fast lossless (image) compression

This post presumes some knowledge of (image) compression, but offers intriguing performance numbers: lossless (de)compression at data rates exceeding 1 GB/s.

There seems to be considerable interest in fast compression, as evidenced by the amount of comments on a new algorithm: http://kdepepo.wordpress.com/2012/01/30/fast-lossless-color-image-compression/

Certainly data sizes and sensor resolutions have ballooned. I am aware of two operational Gigapixel-scale sensors:

The amount of data they produce is just phenomenal, but compression is useful for more mainstream applications as well. Some of the final software I developed at Fraunhofer (together with Dominik Perpeet) was an image viewer that could smoothly zoom within terapixel images: http://www.amostech.com/TechnicalPapers/2011/Poster/PERPEET.pdf That made for a nice platform for the thesis defense – rendering slides into a 1 million by 1 million pixel image, showing them individually, and then zooming out at the end.

That image had to fit on the demo laptop, so there was some fairly simple but effective compression (“LASC”) for four-channel, 16-bit data: http://booksc.org/dl/11938833/073a63
It basically did a brute-force search for similar 1-D runs (instead of the usual 2D blocks) followed by 2x or 4x packing of the residuals. Surprisingly, this came within about 50% of JPEG-2000 while exceeding its speed by a factor of 100.

Unfortunately it was not effective for 8-bit data.  Over the past few months, I’ve been following up on several crazy ideas for SIMD entropy coding, with the result that luminance images from http://www.imagecompression.info/test_images can be compressed by 2.9x at a throughput of 1.3 GPixel/s on my W3550 CPU.

Let’s first look at the predictor. LASC’s brute-force search is too slow – applications such as real-time video compression require symmetric algorithms that compress roughly as fast as they decompress. As pioneered by History Based Blending (http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.34.5981), three sub-predictors on the one-ring of causal neighbors are actually sufficient.

Another shortcoming of LASC was that large residuals require disproportionately more space (possibly even affecting other residuals) due to the simplistic packing. However, only a few bits may be set, especially if represented as sign-magnitude. As with wavelet coders, we will therefore encode the bit planes separately.

The most interesting part is fixed-to-variable coding for entire vector registers of bits. For the image residuals above, it compresses to within 5-20% of the entropy at speeds comparable to or exceeding state of the art integer coders (http://arxiv.org/abs/1209.2137). A single core can compress/decompress between 4600 MB/s (sparse bit arrays with 0.05% ones) and 1000 MB/s (relatively dense, 12% ones). Of course this scales to multiple cores, as there is no shared dictionary to update.

Although the instruction set improved with SSE4, writing vector code is still an exercise in improvisation. Finding 1-bits is accomplished by a hash function based on de Bruijn sequences (http://supertech.csail.mit.edu/papers/debruijn.pdf) and using the FPU normalization hardware. Vectorized shifts are emulated with the PSHUFB universal shuffle instruction.

I will begin writing up the details during the Lunar New Year weekend for later publication.

Leave a Reply