My hobby project for the past two weeks has been implementing Cauchy Reed-Solomon (CRS) codes with low overhead, so that they can be used for real-time communication over wireless Internet. It's finally finished and I have some results to share! =)

The code is somewhat hidden away in a branch of Wirehair right now: https://github.com/catid/wirehair/tree/crs/crs It's not quite finished yet. Read on..

For some background, my Wirehair project is the current speed leader for some types of video/audio streaming applications. For 4 MB/s data-rates, there is really not a better open-source choice at this time. However for messaging applications, Wirehair seems like it is too heavy, and I wanted to improve on the state of the art for short codes as well. My goal is to switch between CRS and Wirehair based on the data-rate.

Anyway: CRS codes were introduced by the Jerasure library, which also holds all of the speed records for m > 2. Sort of. It turns out that CRS codes get slower as you turn up the "m" (the redundancy). The encoder gets slow really fast.

How much redundancy is needed? For a target loss rate of 0.1% and a line loss rate of 20%, I've simulated that k=12 packets need about m=12 additional redundant packets. And for k=180 packets, m=72, which is a little under 256 symbols, which is a soft upper limit on CRS codes.

For k=180, m=72 even my fast codec requires about 10 milliseconds on a laptop to generate the redundant packets. Whereas an asymptotically better codec that I wrote a few years ago (Wirehair) takes just 1.5 milliseconds. The advantage of CRS codes over Wirehair in general is that the decoder is typically much faster. So long as the encoder isn't too slow to make it unusable, or the amount of data isn't too high, CRS codes can be preferable.

For k=12, m=12 the encoder takes 424 microseconds and the decoder takes 169 microseconds. Wirehair takes 210 usec to encode and to decode. But that's best-case Wirehair figures and worst-case decoding time for the CRS code, so CRS codes decode much faster in general.

For target loss rate of 0.1% and line loss of 5%, I've simulated that k=29 packets need about m=7 additional redundant packets. It takes 539 usec to encode with CRS and 381 usec to decode in the worst case. Wirehair takes 400 usec to encode and 334 usec to decode. So the limit to the utility of CRS codes considering other available options is about 32 packets per code group.

What these initial results indicate to me is that since the encoder is so slow, CRS codes for packet error correction are best used for code groups less than 32 packets in duration. This means that switching to a w = 5 code would cover all of the interesting cases while also yielding smaller and better-tuned lookup tables. I can't think of any serious disadvantages of picking a smaller w for a CRS code.

Before I run benchmarks against Jerasure, I'm going to switch to an optimized w=5 or w=6 code to squeeze more performance out of the codec where it's practical.

It's starting to get into the technical aspects of the CRS code, but w=6 may actually be faster than w=5, because I can throw away half of all the possible matrix configurations and keep the best half with fewer one bits set in the bitmatrix, which would be faster for larger packet counts and maybe a little slower for smaller packet counts. I'm expecting the encoder to be almost twice as fast as a result.

If you're still reading this then I'll throw some other interesting packet error correction code facts at you. Google's QUIC uses error correction that is m = 1: It just XORs all the data packets together into one check packet. The CRS codes when used with m = 1 can actually produce the exact same optimal algorithm, and I've specialized my codec to take advantage of it. This is kind of nice so I don't have to have a third codec for m = 1.

There is also a good chance that I can make the decoder about 10% *faster* by windowing the back-substitution step of the Gaussian elimination solver I implemented. For a toy example as it walks up the bitmatrix and sees "11" it can cache the result and then XOR once instead of twice. Wirehair does this with variable window sizes on both the GE and BS steps and it gave a 10% speed boost for each of those functions. However, the decoder is already blindingly fast, so I may just keep the code simple.

More to come! :}

The code is somewhat hidden away in a branch of Wirehair right now: https://github.com/catid/wirehair/tree/crs/crs It's not quite finished yet. Read on..

For some background, my Wirehair project is the current speed leader for some types of video/audio streaming applications. For 4 MB/s data-rates, there is really not a better open-source choice at this time. However for messaging applications, Wirehair seems like it is too heavy, and I wanted to improve on the state of the art for short codes as well. My goal is to switch between CRS and Wirehair based on the data-rate.

Anyway: CRS codes were introduced by the Jerasure library, which also holds all of the speed records for m > 2. Sort of. It turns out that CRS codes get slower as you turn up the "m" (the redundancy). The encoder gets slow really fast.

How much redundancy is needed? For a target loss rate of 0.1% and a line loss rate of 20%, I've simulated that k=12 packets need about m=12 additional redundant packets. And for k=180 packets, m=72, which is a little under 256 symbols, which is a soft upper limit on CRS codes.

For k=180, m=72 even my fast codec requires about 10 milliseconds on a laptop to generate the redundant packets. Whereas an asymptotically better codec that I wrote a few years ago (Wirehair) takes just 1.5 milliseconds. The advantage of CRS codes over Wirehair in general is that the decoder is typically much faster. So long as the encoder isn't too slow to make it unusable, or the amount of data isn't too high, CRS codes can be preferable.

For k=12, m=12 the encoder takes 424 microseconds and the decoder takes 169 microseconds. Wirehair takes 210 usec to encode and to decode. But that's best-case Wirehair figures and worst-case decoding time for the CRS code, so CRS codes decode much faster in general.

For target loss rate of 0.1% and line loss of 5%, I've simulated that k=29 packets need about m=7 additional redundant packets. It takes 539 usec to encode with CRS and 381 usec to decode in the worst case. Wirehair takes 400 usec to encode and 334 usec to decode. So the limit to the utility of CRS codes considering other available options is about 32 packets per code group.

What these initial results indicate to me is that since the encoder is so slow, CRS codes for packet error correction are best used for code groups less than 32 packets in duration. This means that switching to a w = 5 code would cover all of the interesting cases while also yielding smaller and better-tuned lookup tables. I can't think of any serious disadvantages of picking a smaller w for a CRS code.

Before I run benchmarks against Jerasure, I'm going to switch to an optimized w=5 or w=6 code to squeeze more performance out of the codec where it's practical.

It's starting to get into the technical aspects of the CRS code, but w=6 may actually be faster than w=5, because I can throw away half of all the possible matrix configurations and keep the best half with fewer one bits set in the bitmatrix, which would be faster for larger packet counts and maybe a little slower for smaller packet counts. I'm expecting the encoder to be almost twice as fast as a result.

If you're still reading this then I'll throw some other interesting packet error correction code facts at you. Google's QUIC uses error correction that is m = 1: It just XORs all the data packets together into one check packet. The CRS codes when used with m = 1 can actually produce the exact same optimal algorithm, and I've specialized my codec to take advantage of it. This is kind of nice so I don't have to have a third codec for m = 1.

There is also a good chance that I can make the decoder about 10% *faster* by windowing the back-substitution step of the Gaussian elimination solver I implemented. For a toy example as it walks up the bitmatrix and sees "11" it can cache the result and then XOR once instead of twice. Wirehair does this with variable window sizes on both the GE and BS steps and it gave a 10% speed boost for each of those functions. However, the decoder is already blindingly fast, so I may just keep the code simple.

More to come! :}

last edit by catid
edited (>30 days ago) 1:25am Mon. Feb 3rd 2014 PST