Code Poetry
and Text Adventures

by catid posted (>30 days ago) 3:27pm Mon. Jul 1st 2013 PDT
I've been working on improving GCIF (my image format) to support an LZ77 mode so that it will perform better on computer art.  These images are ~90% LZ matches when encoded efficiently, so having a good LZ subsystem is important.  Right now I have a really fast 2D-LZ subsystem that preprocesses rectangular LZ matches with a simple hash table for search.  Unsurprisingly it turns out for most images this is not ideal since a lot of artwork leaves bits outside the rectangles or the copy regions are too slender for a 3x3 window size.  So a lot of data falls to the natural-image compressor part of the codec, which handles the computer art pretty well compared to PNG but really far from ideal.

After some thought I've decided to scrap the 2D-LZ subsystem for now, since it would add a lot of complexity to keep it around.  The new LZ subsystem is going to use escape codes on the Y-channel of RGBA images to indicate that an LZ match is happening, which is something I've toyed with before.  One problem with adding escape codes to the Y channel is that it interrupts zero-RLE.  But images are either mostly LZ or mostly literal, so optimizing the corner case where both is going on seems unimportant right now.  After reviewing a bunch of other codecs, I'm basing it on WebP bitstream format as a starting point, since they seem to get a lot of great results from the way they apply LZ.

WebP's bitstream format is documented online, and it makes a lot of sense based on what I know about this sort of data:

Each LZ match to transmit is a length and distance.  Useful lengths range from 2...1024.  You can go beyond that but it doesn't seem too helpful.  Game textures are 1024x1024 so that's all we really care about.  The distance can vary from back 1 pixel to back ~1024x1024 pixels (to start of image).  Setting some kind of limit to distance makes sense for simplicity, and 1M is reasonable.

Since the length field has a smaller range it makes sense to add some escape codes for the length field to the Y channel of the output image data rather than the distance field.  This would simultaneously indicate that an LZ match is starting.

How to encode the length is an interesting question.  I don't want to add 1000 new symbols to the Y channel because from what I know about prefix coding, having more symbols reduces the code efficiency.  To reduce the symbol count, I'm going to try what WebP is doing and add about 24 escape symbols for the length.  Lower length escape codes are literal lengths like 2-8.  Higher ones start grouping length values together like 9-10 or 11-12.  Then additional bits will be emitted to contain the low precision bits of the length.  This makes sense as the LZ lengths tend to fall off roughly exponentially so it's a good model.

After a length is encountered in the bitstream, a distance can be assumed to come next.  Distances can be similarly coded, though often times in computer art, the source of LZ copies are from a neighbor slightly above and left or right of the current line, so using short codes for those is best.

The last question is how to search for LZ matches.

For GCIF (my image format), compression is supposed to be a slow operation, so optimal parsing makes sense.  That means we want to consider all the options for matches instead of just "one that works."  I can see that for a real-time compression library like LZ4 you would want to do something less optimal.  In fact LZ4 (even HC) is so incredibly fast because it uses a novel approach to search for matches.

Based on what I have been able to get my hands on, there are only two optimal parsing approaches that apply to my project:

(1) Hash chains.
(2) Suffix tries.

Hash chains are ridiculously simple to implement, so I'm going to do that first.  They work great for data that is windowed, and my data is windowed up to 1 million pixels (max distance).

Suffix tries are ridiculously hard to implement.  If I get really terrible compression speed and I cannot fix it with heuristics, I'll switch to this method.  This method has guaranteed O(N) runtimes as of a couple of years ago.  The best implementation of generating this sort of data is with sorting: https://sites.google.com/site/yuta256/sais  The only complication appears to be that the matches are unsorted - you can get matches from behind AND ahead of the current position.  So I think the suffix array would need to be further radix sorted into bins based on location in the data.  And then you'd just look at the bins behind the current position when considering matches.  I'm still not sure how you'd fix degenerate cases any more easily with suffix tries over hash chains.

The big problem with hash chains seems to be cases like AAA...AAAAA, where each "A" would be part of a really long chain.  There are lots of other degenerate cases like ABABABAB etc that give similar problems with compression speed.

Just finding the longest match is not the best option.  They all need to be weighed against the cost of representation in the bitstream.  I'm going to try to just brute-force it, and hopefully it won't be too slow.
last edit by catid edited (>30 days ago) 3:28pm Mon. Jul 1st 2013 PDT