Code Poetry
and Text Adventures

by catid posted (>30 days ago) 1:22am Tue. Jul 9th 2013 PDT
For GCIF I tried a bunch of ways to encode the LZ77 data (length + distance) into the bitstream that did not help at all today.  The solution that seems to work the best is to use the same entropy encoder for the Y channel to encode the LZ match lengths also, indicating a match start, followed by a distance code generated from a separate encoder.

Things that did not work well:

- Using a fixed approximation for encoding cost is bad.  It's really important to know how many bits I am saving with each LZ match.  While the new LZ encoding matches everything it can regardless of how far away, it does a lousy job sometimes knowing when to do it at the moment.  The result is some files are fantastically better compressed and others are slightly worse.  To do a good job, I need to get approximate residuals from the RGBA data and base my decisions more closely on actual number of bits it takes to represent each pixel.

- Using all three channels (YUV) to encode LZ matches.  The RLE zero encoder built into the RGBA compressor is a great thing and I was worried that interrupting runs of zeroes with LZ matches would hurt compression ratio.  So I tried injecting fake zeroes and skipping on to the next channel (U or V) to drop the LZ length match.  Spreading it across all 3 channels made the compression worse.

- Using U or V channel in some cases.  This did not affect the compression by more than a fraction of a percent.  Not worth the complexity.

- Removing it from YUV completely and adding a skip value into the stream that would contain the offset to the next LZ match.  I was thinking maybe since a lot of file compressors do it this way it would work out.  Not so much.

I also experimented with adding more escape symbols to the entropy encoder.  These things failed to help:

- Adding a "same as last" symbol to stand in for symbols other than zeroes.  I expected this would help but no dice.

- Changing the 128 zRLE symbols to mean "repeat last symbol N times".  This was pretty bad.

So yeah today was pretty bad for compression improvements, but at least I don't need to wonder if I could have done it better.

Some more things to try that will probably fail too:

+ Add another bit of information to the LZ length encoding, which indicates that another LZ match immediately follows.  Use a separate set of encoders for the next length and distance.  I see this being good for a couple of reasons: The next length encoding can be much shorter.  And the statistics of follow-up matches may be very different from other types of matches so it's good to separate them.

+ Encode the low bits of the distance with another Huffman code.

+ Encode the low bits of the length with another Huffman code and use the Y-channel escape codes for representing some distance shortcuts.

+ Include one of the high bits of the distance and length in the leading codes.

So the new encoding would be:

Y-Channel Escapes:

(1) Same as distance -1
(2) Same as distance -2
(3) Same as distance -3
(4) Same as distance -4
(5) Literal distance 1
(6) Literal distance 2 (almost never used really)
(7) Literal distance 3
(8...12) Literal distance 8
(13) Local region encoding follows
(14) Global offset encoding follows

Local region encoding is about 128 nearby x,y points; each one is a Huffman code.

Global offset encodes 256 symbols for the high bits and sends the rest as literal bits.  The low 8 bits are Huffman encoded also.

This is followed by a length code:

Huffman code with 128 symbols for lengths 2...65 and extra bits for up to 256.  And another set of 128 symbols that mean the same thing, except that it also indicates an LZ match follows.

The following LZ match uses a distance encoding where (13) and (14) above are merged into the code, and the length encoding is done the same way but with a different table.