I've invented a constant-time algorithm that brings wirehair overhead down from 1.7 to 0.03. So for very little additional processing time, the overhead is significantly improved: Instead of seeing sometimes 4+ additional blocks required during simulation, it's more common to see 0 and sometimes 1. This is something I've never seen done before in any whitepapers I've read on the topic and gives wirehair a shot at replacing Reed Solomon codes in some places.

The current version of wirehair is based on a random binary check matrix. This gives it the same properties as a random binary matrix, which is invertible 30% of the time -- given randomly selected check blocks it is able to recover the original message 30% of the time without overhead. So, on average about N + 1.7 blocks are needed to recover the original message.

Compare this to small random GF(256) matrices, which are invertible ~97% of the time according to simulations. This indicates that if the check matrix was made up of these GF(256) values instead of {0,1} it would have much lower overhead: It would take on average about N + 0.03 blocks to recover the original message. Keep this under your hat and we'll revisit that after some background on GE.

The matrix inversion is done with Gaussian elimination, which as you may recall, searches the matrix from looking for a non-zero row in each column, called a pivot. Whenever a row is selected as a pivot, it cannot be selected again. Since each row can only have a 1 or 0 in each column, there is a 50% chance that it won't find it. Because there are so many rows to choose from when the process starts, Gaussian elimination tends to fail near the end of the process most of the time. So, for a well structured check matrix, failures happen on the far right of the matrix. Unsurprisingly, the probability that a total failure will happen is 0.5^^(Count of Rows Remaining).

Here's a visualization of GE with GF(256) rows attached: http://www.youtube.com/watch?v=GIhcgmwXGmE

If I tack on a few heavy GF(256) rows to the matrix, then I need to know how many to add. We want to have an overall failure rate of 3%, so that means that the number of rows needs to exceed LN(0.03)/LN(0.5) = 5.1, rounding up to 6. This guarantees that the binary matrix failure rate doesn't dominate the overall failure rate.

The next thing to figure out is how to generate each row. Initially I figured some kind of regular pattern would be the best idea, but really the raw output of a random number generator works fine. I tied it to the other seeds so that bad matrices will get tuned out during simulation.

Now, the secret sauce. I decided to add heavy GF(256) columns to the matrix only where they matter: Near the end! The advantages are clear. Instead of having a lot of non-zero columns in the heavy rows, the vast majority of the columns are zero. In fact, the GF(256) submatrix is 6x18 ~108 bytes! Until the last few columns, these few additional rows are ignored. Then when their recovery properties are needed, these heavy rows are rolled out of the armory to fill in holes in the binary check matrix.

I call it constant time because while the check matrix can grow to be 800x800, the GF(256) submatrix stays at 6x18 and never gets any bigger. In fact, the throughput for a 96MB file with 1500 byte blocks is < 0.1 MB/s different, but the number of additional blocks required is 0.03 instead of 1.7.

On my laptop it takes about 150-200 microseconds additional processing time, which is significant at smaller sizes but fades into the noise at larger sizes. For instance, for N=16, without these rows it normally takes about 50 microseconds. So the additional rows greatly reduce throughput for these smaller sizes. However for N=2048 it takes 9 milliseconds so the effect is much smaller.

I'm still working on fixing bugs in the new algorithm called CAT_USE_HEAVY in the code, but I've been able to simulate fail rates end to end with the full encoder + decoder and everything seems to be mathematically sound.

The current version of wirehair is based on a random binary check matrix. This gives it the same properties as a random binary matrix, which is invertible 30% of the time -- given randomly selected check blocks it is able to recover the original message 30% of the time without overhead. So, on average about N + 1.7 blocks are needed to recover the original message.

Compare this to small random GF(256) matrices, which are invertible ~97% of the time according to simulations. This indicates that if the check matrix was made up of these GF(256) values instead of {0,1} it would have much lower overhead: It would take on average about N + 0.03 blocks to recover the original message. Keep this under your hat and we'll revisit that after some background on GE.

The matrix inversion is done with Gaussian elimination, which as you may recall, searches the matrix from looking for a non-zero row in each column, called a pivot. Whenever a row is selected as a pivot, it cannot be selected again. Since each row can only have a 1 or 0 in each column, there is a 50% chance that it won't find it. Because there are so many rows to choose from when the process starts, Gaussian elimination tends to fail near the end of the process most of the time. So, for a well structured check matrix, failures happen on the far right of the matrix. Unsurprisingly, the probability that a total failure will happen is 0.5^^(Count of Rows Remaining).

Here's a visualization of GE with GF(256) rows attached: http://www.youtube.com/watch?v=GIhcgmwXGmE

If I tack on a few heavy GF(256) rows to the matrix, then I need to know how many to add. We want to have an overall failure rate of 3%, so that means that the number of rows needs to exceed LN(0.03)/LN(0.5) = 5.1, rounding up to 6. This guarantees that the binary matrix failure rate doesn't dominate the overall failure rate.

The next thing to figure out is how to generate each row. Initially I figured some kind of regular pattern would be the best idea, but really the raw output of a random number generator works fine. I tied it to the other seeds so that bad matrices will get tuned out during simulation.

Now, the secret sauce. I decided to add heavy GF(256) columns to the matrix only where they matter: Near the end! The advantages are clear. Instead of having a lot of non-zero columns in the heavy rows, the vast majority of the columns are zero. In fact, the GF(256) submatrix is 6x18 ~108 bytes! Until the last few columns, these few additional rows are ignored. Then when their recovery properties are needed, these heavy rows are rolled out of the armory to fill in holes in the binary check matrix.

I call it constant time because while the check matrix can grow to be 800x800, the GF(256) submatrix stays at 6x18 and never gets any bigger. In fact, the throughput for a 96MB file with 1500 byte blocks is < 0.1 MB/s different, but the number of additional blocks required is 0.03 instead of 1.7.

On my laptop it takes about 150-200 microseconds additional processing time, which is significant at smaller sizes but fades into the noise at larger sizes. For instance, for N=16, without these rows it normally takes about 50 microseconds. So the additional rows greatly reduce throughput for these smaller sizes. However for N=2048 it takes 9 milliseconds so the effect is much smaller.

I'm still working on fixing bugs in the new algorithm called CAT_USE_HEAVY in the code, but I've been able to simulate fail rates end to end with the full encoder + decoder and everything seems to be mathematically sound.

last edit by catid
edited (>30 days ago) 3:56am Sun. Mar 4th 2012 PST