Today I think I finally understood how prefix codes are supposed to be used for compression. It's not really what I've been lead to believe from what I've read online.

Typically people talk about prefix codes like Huffman codes as an "entropy encoding." A well-known issue with Huffman codes is that they have a minimum length of 1 bit, which is solved in arithmetic/range coders by design. For Huffman codes you often will want to add some kind of run-length encoding to handle cases where this becomes an issue. But the real issue is something deeper IMHO, read on...

Prefix codes are an optimization. They're supposed to be used for getting good performance in a codec because their decoding can be accomplished with a single table lookup. This also means that having fewer symbols in the symbol set is better because it makes the table smaller and faster. But there are other reasons for a small symbol set...

Everything from here on out is just my intuition, and may be completely false. But I think it's right:

The other problem with prefix codes AND arithmetic/range codes is that they also have a fixed maximum length in practice. Sort of. Except that should *never* be a problem. The maximum bits to represent some input literal data should be just a little more than the original literal, despite how unlikely that data was. This indicates something is missing from my understanding: Shouldn't this "entropy code" handling all these tail cases?

Focusing on prefix codes, intuitively the whole idea of using prefix encoding to represent data given the symbol probability is not how compression should work. There needs to be another layer of modeling in between the input data modeling and the symbol generation for Huffman encoding.

For instance, my image compressor generates data that is mostly zeroes as a result of filtering (modeling) the input image pixels. There are a lot of +1, -1 symbols, and a few less +2, -2 symbols, and so on.

I've been "fixing" the minimum-length-of-1-bit problem by introducing zRLE codes into the symbol set. This expands the zeroes into 128 additional symbols, so that each one will be more than 1 bit to represent and so reduce the effect of quantization error. This helps tremendously with compression efficiency.

But right now I'm neglecting the other end of the spectrum: How to handle codes like +56 or -72 that rarely happen? They're so rare that their encoding is terribly inefficient. But since they're rare it doesn't make a lot of difference, right? Especially since practical Huffman encoders have an upper limit on the code size of about 16 bits, this is not a huge problem even if it happens a lot more than it should. But it also uses valuable symbol slots for something that may never happen at all, slowing down the decoder and moving away from the "sweet spot" near ideal binary codes where prefix coders work best.

I'm thinking that in general there needs to be another modeling step that takes the histogram of the original -128 to +127 data after filtering and converts it into a set of symbols for prefix encoding. Combining consecutive 0,0,0.. data is being done now, but really I should also be combining 0,1,1, and 1,1,-1, etc as it makes sense. Instead of having a symbol code for "+56", I should combine these data together so that "+54 through +57" is a symbol and some extra bits are emitted for the remainder.

Let's back up. Right now the data flows through my encoder like this:

RGBA + Color Decorrelation Filtering => YUVA + Spatial Filtering => Symbols + Context Chaos Modeling and Static Huffman Encoding => File Bits

I think there needs to be another modeling step. Because the Symbols in the above flow are exponentially distributed, and the Huffman encoder works best with uniformly distributed symbols.

RGBA + Color Decorrelation Filtering => YUVA + Spatial Filtering => Symbols + Symbol Model => Smoothed Symbols + Context Chaos Modeling and Static Huffman Encoding => File Bits

In just a few words: The goal is to make each Huffman symbol equiprobable, so that the Huffman code is as close to an ideal code as possible.

The Huffman encoder can tolerate some error in the input when things aren't exactly ideal, but I should make an effort to get it close.

The symbol model has two modes:

(1) To make the too-common symbols less common, I can combine runs of symbols together into one symbol: {0,0} => {3} and {1,1} => {4}.

(2) To make the too-rare symbols more common, I can combine several symbols into one and emit extra bits: {65-68} => {16}+(2 bits).

I just need to collect statistics from the entire test set of video game graphics we are using, determine the encoding method that will make the symbols closest to equiprobable on average, and hard-code that as the new bitstream file format.

I want to have about 256 symbols after making these changes, which feels right:

(1) The Huffman decoding table is not too big, so the table lookup is fast.

(2) The table is not too small, so that there are lots of options for encoding efficiently.

(3) The table is not too big, so that the histogram variance is lower and symbols are not too common nor too rare.

I'm going to wait until after the new LZ encoding is working, since that will affect the statistics.

It may also be that the symbol distribution varies enough to have several types of models that can be selected in the encoder.

Typically people talk about prefix codes like Huffman codes as an "entropy encoding." A well-known issue with Huffman codes is that they have a minimum length of 1 bit, which is solved in arithmetic/range coders by design. For Huffman codes you often will want to add some kind of run-length encoding to handle cases where this becomes an issue. But the real issue is something deeper IMHO, read on...

Prefix codes are an optimization. They're supposed to be used for getting good performance in a codec because their decoding can be accomplished with a single table lookup. This also means that having fewer symbols in the symbol set is better because it makes the table smaller and faster. But there are other reasons for a small symbol set...

Everything from here on out is just my intuition, and may be completely false. But I think it's right:

The other problem with prefix codes AND arithmetic/range codes is that they also have a fixed maximum length in practice. Sort of. Except that should *never* be a problem. The maximum bits to represent some input literal data should be just a little more than the original literal, despite how unlikely that data was. This indicates something is missing from my understanding: Shouldn't this "entropy code" handling all these tail cases?

Focusing on prefix codes, intuitively the whole idea of using prefix encoding to represent data given the symbol probability is not how compression should work. There needs to be another layer of modeling in between the input data modeling and the symbol generation for Huffman encoding.

For instance, my image compressor generates data that is mostly zeroes as a result of filtering (modeling) the input image pixels. There are a lot of +1, -1 symbols, and a few less +2, -2 symbols, and so on.

I've been "fixing" the minimum-length-of-1-bit problem by introducing zRLE codes into the symbol set. This expands the zeroes into 128 additional symbols, so that each one will be more than 1 bit to represent and so reduce the effect of quantization error. This helps tremendously with compression efficiency.

But right now I'm neglecting the other end of the spectrum: How to handle codes like +56 or -72 that rarely happen? They're so rare that their encoding is terribly inefficient. But since they're rare it doesn't make a lot of difference, right? Especially since practical Huffman encoders have an upper limit on the code size of about 16 bits, this is not a huge problem even if it happens a lot more than it should. But it also uses valuable symbol slots for something that may never happen at all, slowing down the decoder and moving away from the "sweet spot" near ideal binary codes where prefix coders work best.

I'm thinking that in general there needs to be another modeling step that takes the histogram of the original -128 to +127 data after filtering and converts it into a set of symbols for prefix encoding. Combining consecutive 0,0,0.. data is being done now, but really I should also be combining 0,1,1, and 1,1,-1, etc as it makes sense. Instead of having a symbol code for "+56", I should combine these data together so that "+54 through +57" is a symbol and some extra bits are emitted for the remainder.

Let's back up. Right now the data flows through my encoder like this:

RGBA + Color Decorrelation Filtering => YUVA + Spatial Filtering => Symbols + Context Chaos Modeling and Static Huffman Encoding => File Bits

I think there needs to be another modeling step. Because the Symbols in the above flow are exponentially distributed, and the Huffman encoder works best with uniformly distributed symbols.

RGBA + Color Decorrelation Filtering => YUVA + Spatial Filtering => Symbols + Symbol Model => Smoothed Symbols + Context Chaos Modeling and Static Huffman Encoding => File Bits

In just a few words: The goal is to make each Huffman symbol equiprobable, so that the Huffman code is as close to an ideal code as possible.

The Huffman encoder can tolerate some error in the input when things aren't exactly ideal, but I should make an effort to get it close.

The symbol model has two modes:

(1) To make the too-common symbols less common, I can combine runs of symbols together into one symbol: {0,0} => {3} and {1,1} => {4}.

(2) To make the too-rare symbols more common, I can combine several symbols into one and emit extra bits: {65-68} => {16}+(2 bits).

I just need to collect statistics from the entire test set of video game graphics we are using, determine the encoding method that will make the symbols closest to equiprobable on average, and hard-code that as the new bitstream file format.

I want to have about 256 symbols after making these changes, which feels right:

(1) The Huffman decoding table is not too big, so the table lookup is fast.

(2) The table is not too small, so that there are lots of options for encoding efficiently.

(3) The table is not too big, so that the histogram variance is lower and symbols are not too common nor too rare.

I'm going to wait until after the new LZ encoding is working, since that will affect the statistics.

It may also be that the symbol distribution varies enough to have several types of models that can be selected in the encoder.

last edit by catid
edited (>30 days ago) 5:47pm Mon. Jul 1st 2013 PDT