I've been able to achieve some additional compression by better tuning the LZ77 matches and revising the bitstream format.

Based on some test files I came up with a rough approximation of how residuals map to bits used in the final file to represent RGB data. I'm doing a rough pass through the RGBA data to come up with filters quickly, then generating residuals using L1 norm instead of entropy estimate. This gives me an upper bound on how many bits are saved by representing pixels with LZ matches instead of the natural compressor. If the compressor cannot at least stay under this upper bound to match a set of pixels, then it should not emit the match. This has measurable benefits!

The distance code has been split into two Huffman codes, one for LongDistances and one for ShortDistances in the local neighborhood. And I removed the extra Huffman codes. They didn't seem to really help much, and it will simplify the decoder.

The LongDistance code now has 256 symbols and can be followed by up to 16 bits. Often times these matches will be discovered even though they make compression worse. To counter-balance this, I now do a simulated write before accepting matches. I reject the ones that will definitely be too big and then recalculate the statistics to take advantage of the reduced dataset.

At this point for RGBA test files, the new LZ77 encoder does significantly better (>10% smaller files) fairly often, and occasionally does worse than the old 2D-LZ encoder. It will occasionally lose ~1KB on ~200+KB files in comparison with the 2D-LZ encoder. I think this is because those files have a substantial amount of copied pixels that are more efficiently represented with 2D encoding. However this is rarely the case, and I now believe it is not worth including 2D-LZ compression so that the decoder can run faster.

Based on some test files I came up with a rough approximation of how residuals map to bits used in the final file to represent RGB data. I'm doing a rough pass through the RGBA data to come up with filters quickly, then generating residuals using L1 norm instead of entropy estimate. This gives me an upper bound on how many bits are saved by representing pixels with LZ matches instead of the natural compressor. If the compressor cannot at least stay under this upper bound to match a set of pixels, then it should not emit the match. This has measurable benefits!

The distance code has been split into two Huffman codes, one for LongDistances and one for ShortDistances in the local neighborhood. And I removed the extra Huffman codes. They didn't seem to really help much, and it will simplify the decoder.

The LongDistance code now has 256 symbols and can be followed by up to 16 bits. Often times these matches will be discovered even though they make compression worse. To counter-balance this, I now do a simulated write before accepting matches. I reject the ones that will definitely be too big and then recalculate the statistics to take advantage of the reduced dataset.

At this point for RGBA test files, the new LZ77 encoder does significantly better (>10% smaller files) fairly often, and occasionally does worse than the old 2D-LZ encoder. It will occasionally lose ~1KB on ~200+KB files in comparison with the 2D-LZ encoder. I think this is because those files have a substantial amount of copied pixels that are more efficiently represented with 2D encoding. However this is rarely the case, and I now believe it is not worth including 2D-LZ compression so that the decoder can run faster.

* RGBA LZ Escape Bitstream Format

*

* Attached to the Y channel of the YUV encoder output are a number of

* extra "escape codes" that cause the decoder to go into an LZ decoding

* mode for the next few bits.

*

*

* Nice features of this format:

*

* (1) Include "same as last 1-4 distance" code and use it as often as possible.

*

* (2) Use most common 2D distance matches for escape codes in addition to common

* lengths. This results in LZ matches under 5 pixels that often take only 7-9 bits

* to represent overall.

*

* (3) Use only short literal lengths for escape codes. This is great because only

* long matches will make it through to the default "complex LZ" escape code, where

* I do not care too much about being careful about bit representation.

*

* (4) Only match up to 256 and if I have to emit a match length, emit a literal

* Huffman code for 2-256. Longer matches do not matter and this makes full use of

* the statistics.

*

* (5) To encode distances, an extended "local region" of pixels up to 16 to the

* left/right for the current and previous row, and -8 through +8 for the next 7

* rows are assigned symbols in a Huffman code, in addition to a bit count encoding

* that indicates how many bits the distance contains as well as some of the high bits.

*

*

* The extra (32) escape codes are:

*

* 256 = "Distance Same as 1st Last"

* 257 = "Distance Same as 2nd Last"

* 258 = "Distance Same as 3rd Last"

* 259 = "Distance Same as 4th Last"

* 260 = "Distance = 1"

* 261 = "Distance = 3"

* 262 = "Distance = 4"

* 263 = "Distance = 5"

* 264 = "Distance = 6"

* 265 = "Distance = (-2, -1)"

* 266 = "Distance = (-1, -1)"

* 267 = "Distance = (0, -1)"

* 268 = "Distance = (1, -1)"

* 269 = "Distance = (2, -1)"

* 270 = "Length = 2" + ShortDistance code follows

* 271 = "Length = 3" + ShortDistance code follows

* 272 = "Length = 4" + ShortDistance code follows

* 273 = "Length = 5" + ShortDistance code follows

* 274 = "Length = 6" + ShortDistance code follows

* 275 = "Length = 7" + ShortDistance code follows

* 276 = "Length = 8" + ShortDistance code follows

* 277 = "Length = 9" + ShortDistance code follows

* 278 = Length code + ShortDistance code follows

* 279 = "Length = 2" + LongDistance code follows

* 280 = "Length = 3" + LongDistance code follows

* 281 = "Length = 4" + LongDistance code follows

* 282 = "Length = 5" + LongDistance code follows

* 283 = "Length = 6" + LongDistance code follows

* 284 = "Length = 7" + LongDistance code follows

* 285 = "Length = 8" + LongDistance code follows

* 286 = "Length = 9" + LongDistance code follows

* 287 = Length code + LongDistance code follows

*

* For codes 256-269, the following bits are a Length code.

*

*

* Length Huffman code:

*

* 0 = "Length = 2"

* 1 = "Length = 3"

* ...

* 254 = "Length = 256"

*

*

* ShortDistance Huffman code:

*

* 0 = "Distance = 2"

* 1 = "Distance = 7"

* 2 = "Distance = 8"

* 3 = "Distance = 9"

* 4 = "Distance = 10"

* 5 = "Distance = 11"

* 6 = "Distance = 12"

* 7 = "Distance = 13"

* 8 = "Distance = 14"

* 9 = "Distance = 15"

* 10 = "Distance = 16"

*

* 11 = "Distance = (-16, -1)"

* 12 = "Distance = (-15, -1)"

* ...

* 24 = "Distance = (-3, -1)"

* 25 = "Distance = (3, -1)"

* ...

* 38 = "Distance = (16, -1)"

*

* 39 - 55 = "Distance (-8, -2) ... (8, -2)"

* 56 - 72 = "Distance (-8, -3) ... (8, -3)"

* 73 - 89 = "Distance (-8, -4) ... (8, -4)"

* 90 - 106 = "Distance (-8, -5) ... (8, -5)"

* 107 - 123 = "Distance (-8, -6) ... (8, -6)"

* 124 - 140 = "Distance (-8, -7) ... (8, -7)"

* 141 - 157 = "Distance (-8, -8) ... (8, -8)"

*

*

* LongDistance Huffman code:

*

* Let D = Distance - 17.

*

* Longer distances are encoded using 256 symbols to represent the high

* bits and the number of bits remaining at once. Extra bits follow

* the distance code.

*

* Converting from D to (Code, EB):

* EB = Bits(D >> 5) + 1

* C0 = ((1 << (EB - 1)) - 1) << 5

* Code = ((D - C0) >> EB) + ((EB - 1) << 4)

*

* Converting from Code to (D0, EB):

* EB = (Code >> 4) + 1

* C0 = ((1 << (EB - 1)) - 1) << 5

* D0 = ((Code - ((EB - 1) << 4)) << EB) + C0;

* D = D0 + ExtraBits(EB)

*

* Attached to the Y channel of the YUV encoder output are a number of

* extra "escape codes" that cause the decoder to go into an LZ decoding

* mode for the next few bits.

*

*

* Nice features of this format:

*

* (1) Include "same as last 1-4 distance" code and use it as often as possible.

*

* (2) Use most common 2D distance matches for escape codes in addition to common

* lengths. This results in LZ matches under 5 pixels that often take only 7-9 bits

* to represent overall.

*

* (3) Use only short literal lengths for escape codes. This is great because only

* long matches will make it through to the default "complex LZ" escape code, where

* I do not care too much about being careful about bit representation.

*

* (4) Only match up to 256 and if I have to emit a match length, emit a literal

* Huffman code for 2-256. Longer matches do not matter and this makes full use of

* the statistics.

*

* (5) To encode distances, an extended "local region" of pixels up to 16 to the

* left/right for the current and previous row, and -8 through +8 for the next 7

* rows are assigned symbols in a Huffman code, in addition to a bit count encoding

* that indicates how many bits the distance contains as well as some of the high bits.

*

*

* The extra (32) escape codes are:

*

* 256 = "Distance Same as 1st Last"

* 257 = "Distance Same as 2nd Last"

* 258 = "Distance Same as 3rd Last"

* 259 = "Distance Same as 4th Last"

* 260 = "Distance = 1"

* 261 = "Distance = 3"

* 262 = "Distance = 4"

* 263 = "Distance = 5"

* 264 = "Distance = 6"

* 265 = "Distance = (-2, -1)"

* 266 = "Distance = (-1, -1)"

* 267 = "Distance = (0, -1)"

* 268 = "Distance = (1, -1)"

* 269 = "Distance = (2, -1)"

* 270 = "Length = 2" + ShortDistance code follows

* 271 = "Length = 3" + ShortDistance code follows

* 272 = "Length = 4" + ShortDistance code follows

* 273 = "Length = 5" + ShortDistance code follows

* 274 = "Length = 6" + ShortDistance code follows

* 275 = "Length = 7" + ShortDistance code follows

* 276 = "Length = 8" + ShortDistance code follows

* 277 = "Length = 9" + ShortDistance code follows

* 278 = Length code + ShortDistance code follows

* 279 = "Length = 2" + LongDistance code follows

* 280 = "Length = 3" + LongDistance code follows

* 281 = "Length = 4" + LongDistance code follows

* 282 = "Length = 5" + LongDistance code follows

* 283 = "Length = 6" + LongDistance code follows

* 284 = "Length = 7" + LongDistance code follows

* 285 = "Length = 8" + LongDistance code follows

* 286 = "Length = 9" + LongDistance code follows

* 287 = Length code + LongDistance code follows

*

* For codes 256-269, the following bits are a Length code.

*

*

* Length Huffman code:

*

* 0 = "Length = 2"

* 1 = "Length = 3"

* ...

* 254 = "Length = 256"

*

*

* ShortDistance Huffman code:

*

* 0 = "Distance = 2"

* 1 = "Distance = 7"

* 2 = "Distance = 8"

* 3 = "Distance = 9"

* 4 = "Distance = 10"

* 5 = "Distance = 11"

* 6 = "Distance = 12"

* 7 = "Distance = 13"

* 8 = "Distance = 14"

* 9 = "Distance = 15"

* 10 = "Distance = 16"

*

* 11 = "Distance = (-16, -1)"

* 12 = "Distance = (-15, -1)"

* ...

* 24 = "Distance = (-3, -1)"

* 25 = "Distance = (3, -1)"

* ...

* 38 = "Distance = (16, -1)"

*

* 39 - 55 = "Distance (-8, -2) ... (8, -2)"

* 56 - 72 = "Distance (-8, -3) ... (8, -3)"

* 73 - 89 = "Distance (-8, -4) ... (8, -4)"

* 90 - 106 = "Distance (-8, -5) ... (8, -5)"

* 107 - 123 = "Distance (-8, -6) ... (8, -6)"

* 124 - 140 = "Distance (-8, -7) ... (8, -7)"

* 141 - 157 = "Distance (-8, -8) ... (8, -8)"

*

*

* LongDistance Huffman code:

*

* Let D = Distance - 17.

*

* Longer distances are encoded using 256 symbols to represent the high

* bits and the number of bits remaining at once. Extra bits follow

* the distance code.

*

* Converting from D to (Code, EB):

* EB = Bits(D >> 5) + 1

* C0 = ((1 << (EB - 1)) - 1) << 5

* Code = ((D - C0) >> EB) + ((EB - 1) << 4)

*

* Converting from Code to (D0, EB):

* EB = (Code >> 4) + 1

* C0 = ((1 << (EB - 1)) - 1) << 5

* D0 = ((Code - ((EB - 1) << 4)) << EB) + C0;

* D = D0 + ExtraBits(EB)

last edit by catid
edited (>30 days ago) 2:56am Sun. Jul 14th 2013 PDT