Given a constraint on a matrix size (MxK), what's the best Galois field MDS (Reed Solomon) matrix (of any polynomial or bit count) for multiplication in the context of erasure codes?

Some background:

Galois field (GF) multiplication boils down to XORs.

GF elements are numbers between 0 and 2^^(bit count). Let the bit count be denoted B.

GF elements can be expanded out into BxB binary matrices. This is how my Longhair codec works; it tries to use matrices with B = 8 such that the number of ones in these binary matrices is minimized.

When I talk about multiplication I mean in the context of erasure codes that the input packets (assume they are all padded out to the same size) are XORed together based on the matrix to produce a set of check symbols, which are each the same size as well.

If the matrix is:

Then you'll need to know which GF it is corresponding to. How many bits? GF(256) would be an 8 bit field, but there are also 16 choices of GF polynomial that define how the numbers 0..256 are mapped to 8x8 binary matrices.

Once the GF is selected, then the matrix can be expanded out to a binary one. The element 1 always corresponds to an identity matrix:

When split up this way, the input data and the output data are also split into 4 pieces each. If the input/output is 32 bytes long, then the first 8 bytes of the input data are addressed by the first column of the bit matrix. The second 8 bytes at offset 8 bytes of the input data are addressed by the second column of the bit matrix.

Similarly, the output data is 32 bytes and the first row corresponds to the first 8 bytes of the output data. Then 4rth row corresponds to the last 8 bytes of the output data.

So if the first row is [1 1 0 0], then the first 8 bytes of output data will be the XOR of the first 8 bytes of the input data with the second 8 bytes of the input data.

And each column of the GF matrix corresponds to a different input buffer. A matrix with 5 columns will have 5 input buffers, which may be different packets it is encoding, padded out to be the same length.

This means that calculations to produce the output cannot be reused between GF columns, but can be reused between different bitmatrix columns from inside the same GF column. This is because the bitmatrix columns in this case are all XOR combinations of the same input buffer, so if the bitmatrix rows repeat then temporary results can be reused.

My Longhair codec precalculates above 5 rows of output; it will precalculate some data and reuse it instead of recalculating it for each row. This cuts down on the number of XORs by 3 times over. With a carefully selected matrix it would appear that the number of overall XORs can be much lower if the matrix is selected to optimize for this metric.

It's also true that the expanded bitmatrix for each GF element of the GF matrix has two valid orientations. It's possible to take the transpose of each expanded bitmatrix and the result will still be a MDS matrix. This has been verified by how Longhair and Jerasure use opposite orientations for the bitmatrices and they both produces MDS matrices.

Comparison with MDS-AES types

These erasure code specialized multiplication matrices can out-perform the MDS-AES types because the MDS-AES types are required to be involutory.

The design of MDS-AES types does not take into account the bitmatrix XOR approach. I've looked at the recent papers by Sajadieh and Gupta (On constructions of involutory MDS matrices). They show great results based on Cauchy matrices also.

Here's one 3x3 result they present:

This requires 6 whole input buffer XORs on top of three expensive GF multiplications.

However, erasure codes do not require the involutory property, and I am working on large input/output data buffers so splitting up the GF matrix into a large bitmatrix and optimizing for individual XOR operations is possible, so I can come up with matrices like this:

This matrix requires just 7.28571 XORs of input data to multiply.

I would evaluate the middle column first to avoid allocating any temporary space for reusing data.

Note that no GF multiplications are required and it's close to just XORing the inputs together (which would be 6 XORs). Since the first row is all ones the input data can be XORed together in bulk, which is something like 4-8 GB/s fast and is essentially free.

Example 4x4 matrix

I searched for a 4x4 GF MDS matrix useful for erasure codes with the smallest number of XORs required to multiply by it. This would be used by the erasure code encoder to produce 4 output symbols, which could then be used in place of the input data.

I produced random 4x4 Cauchy matrices. I forced the first row to be all ones since that gives a super-fast shortcut to generate the first output buffer (just XOR all inputs).

It turned out that a Galois field with 5 bits had the minimum cost. The field is generated by polynomial 0x12:

Note that columns can be rearranged and it will still be an MDS matrix, so it's possible to construct a very fast matrix with just 3 columns by sorting the columns by complexity and dropping the most expensive one.

There is a path through the above that only requires 14.8 XORs of input.

The MDS-AES matrix from Sajadieh is:

This matrix requires 12 XORs and four expensive GF multiplications. My version requires just 2.8 XORs instead of these multiplications and can be done with less temporary space (partial blocks rather than full blocks if anything).

Conclusions

The main point here is that the involutory property desired for AES-MDS hinders performance of these matrices when used for erasure codes.

I also wanted to draw attention to the real optimization problem which is different. That is how to reduce the number of intermediate XOR sums that must be computed to get at the product of the multiplication.

One interesting possibility that I've neglected is that it may be possible for larger matrices that temporary products like T2 = 0x12(x3 + x4) can be reused with more structured matrices. I've assumed that reuse is only possible within a GF matrix column. However useful intermediate products might span columns with a more structured matrix.

It is possible there are slightly faster matrices because I didn't search the entire NP-hard problem space. I also neglected the possibility of transposing the bitmatrices.

Some background:

Galois field (GF) multiplication boils down to XORs.

GF elements are numbers between 0 and 2^^(bit count). Let the bit count be denoted B.

GF elements can be expanded out into BxB binary matrices. This is how my Longhair codec works; it tries to use matrices with B = 8 such that the number of ones in these binary matrices is minimized.

When I talk about multiplication I mean in the context of erasure codes that the input packets (assume they are all padded out to the same size) are XORed together based on the matrix to produce a set of check symbols, which are each the same size as well.

If the matrix is:

[

01 02

03 04

]

01 02

03 04

]

Then you'll need to know which GF it is corresponding to. How many bits? GF(256) would be an 8 bit field, but there are also 16 choices of GF polynomial that define how the numbers 0..256 are mapped to 8x8 binary matrices.

Once the GF is selected, then the matrix can be expanded out to a binary one. The element 1 always corresponds to an identity matrix:

[

1 0 0 0

0 1 0 0

0 0 1 0

0 0 0 1

]

1 0 0 0

0 1 0 0

0 0 1 0

0 0 0 1

]

When split up this way, the input data and the output data are also split into 4 pieces each. If the input/output is 32 bytes long, then the first 8 bytes of the input data are addressed by the first column of the bit matrix. The second 8 bytes at offset 8 bytes of the input data are addressed by the second column of the bit matrix.

Similarly, the output data is 32 bytes and the first row corresponds to the first 8 bytes of the output data. Then 4rth row corresponds to the last 8 bytes of the output data.

So if the first row is [1 1 0 0], then the first 8 bytes of output data will be the XOR of the first 8 bytes of the input data with the second 8 bytes of the input data.

And each column of the GF matrix corresponds to a different input buffer. A matrix with 5 columns will have 5 input buffers, which may be different packets it is encoding, padded out to be the same length.

This means that calculations to produce the output cannot be reused between GF columns, but can be reused between different bitmatrix columns from inside the same GF column. This is because the bitmatrix columns in this case are all XOR combinations of the same input buffer, so if the bitmatrix rows repeat then temporary results can be reused.

My Longhair codec precalculates above 5 rows of output; it will precalculate some data and reuse it instead of recalculating it for each row. This cuts down on the number of XORs by 3 times over. With a carefully selected matrix it would appear that the number of overall XORs can be much lower if the matrix is selected to optimize for this metric.

It's also true that the expanded bitmatrix for each GF element of the GF matrix has two valid orientations. It's possible to take the transpose of each expanded bitmatrix and the result will still be a MDS matrix. This has been verified by how Longhair and Jerasure use opposite orientations for the bitmatrices and they both produces MDS matrices.

Comparison with MDS-AES types

These erasure code specialized multiplication matrices can out-perform the MDS-AES types because the MDS-AES types are required to be involutory.

The design of MDS-AES types does not take into account the bitmatrix XOR approach. I've looked at the recent papers by Sajadieh and Gupta (On constructions of involutory MDS matrices). They show great results based on Cauchy matrices also.

Here's one 3x3 result they present:

[

2 7 4

3 6 4

3 7 5

]

T1 = 2x1, T2 = 7x2, T3 = 4x3,

y1 = T1 + T2 + T3

y2 = y1 + x1 + x2

y3 = y1 + x1 + x3

2 7 4

3 6 4

3 7 5

]

T1 = 2x1, T2 = 7x2, T3 = 4x3,

y1 = T1 + T2 + T3

y2 = y1 + x1 + x2

y3 = y1 + x1 + x3

This requires 6 whole input buffer XORs on top of three expensive GF multiplications.

However, erasure codes do not require the involutory property, and I am working on large input/output data buffers so splitting up the GF matrix into a large bitmatrix and optimizing for individual XOR operations is possible, so I can come up with matrices like this:

GF(128) poly = 0x41

[

01 01 01

41 20 01

01 03 41

]

{

1000000 1000000 1000000

0100000 0100000 0100000

0010000 0010000 0010000

0001000 0001000 0001000

0000100 0000100 0000100

0000010 0000010 0000010

0000001 0000001 0000001

1000001 0000010 1000000

1000000 0000001 0100000

0100000 1100000 0010000

0010000 0110000 0001000

0001000 0011000 0000100

0000100 0001100 0000010

0000010 0000110 0000001

1000000 1100000 1000001

0100000 0110000 1000000

0010000 0011000 0100000

0001000 0001100 0010000

0000100 0000110 0001000

0000010 0000011 0000100

0000001 1100001 0000010

}

[

01 01 01

41 20 01

01 03 41

]

{

1000000 1000000 1000000

0100000 0100000 0100000

0010000 0010000 0010000

0001000 0001000 0001000

0000100 0000100 0000100

0000010 0000010 0000010

0000001 0000001 0000001

1000001 0000010 1000000

1000000 0000001 0100000

0100000 1100000 0010000

0010000 0110000 0001000

0001000 0011000 0000100

0000100 0001100 0000010

0000010 0000110 0000001

1000000 1100000 1000001

0100000 0110000 1000000

0010000 0011000 0100000

0001000 0001100 0010000

0000100 0000110 0001000

0000010 0000011 0000100

0000001 1100001 0000010

}

This matrix requires just 7.28571 XORs of input data to multiply.

I would evaluate the middle column first to avoid allocating any temporary space for reusing data.

Note that no GF multiplications are required and it's close to just XORing the inputs together (which would be 6 XORs). Since the first row is all ones the input data can be XORed together in bulk, which is something like 4-8 GB/s fast and is essentially free.

Example 4x4 matrix

I searched for a 4x4 GF MDS matrix useful for erasure codes with the smallest number of XORs required to multiply by it. This would be used by the erasure code encoder to produce 4 output symbols, which could then be used in place of the input data.

I produced random 4x4 Cauchy matrices. I forced the first row to be all ones since that gives a super-fast shortcut to generate the first output buffer (just XOR all inputs).

It turned out that a Galois field with 5 bits had the minimum cost. The field is generated by polynomial 0x12:

[

01 01 01 01

04 08 12 01

10 01 16 02

08 01 12 10

]

{

10000 10000 10000 10000

01000 01000 01000 01000

00100 00100 00100 00100

00010 00010 00010 00010

00001 00001 00001 00001

00100 00010 01001 10000

00010 00001 10000 01000

00001 10100 01000 00100

10100 01010 00100 00010

01010 00101 00010 00001

00001 10000 01101 01000

10100 01000 10010 00100

01010 00100 01001 00010

00101 00010 10000 00001

10110 00001 01000 10100

00010 10000 01001 00001

00001 01000 10000 10100

10100 00100 01000 01010

01010 00010 00100 00101

00101 00001 00010 10110

}

01 01 01 01

04 08 12 01

10 01 16 02

08 01 12 10

]

{

10000 10000 10000 10000

01000 01000 01000 01000

00100 00100 00100 00100

00010 00010 00010 00010

00001 00001 00001 00001

00100 00010 01001 10000

00010 00001 10000 01000

00001 10100 01000 00100

10100 01010 00100 00010

01010 00101 00010 00001

00001 10000 01101 01000

10100 01000 10010 00100

01010 00100 01001 00010

00101 00010 10000 00001

10110 00001 01000 10100

00010 10000 01001 00001

00001 01000 10000 10100

10100 00100 01000 01010

01010 00010 00100 00101

00101 00001 00010 10110

}

Note that columns can be rearranged and it will still be an MDS matrix, so it's possible to construct a very fast matrix with just 3 columns by sorting the columns by complexity and dropping the most expensive one.

There is a path through the above that only requires 14.8 XORs of input.

The MDS-AES matrix from Sajadieh is:

[

01 05 12 17

05 01 17 12

12 17 01 05

17 12 05 01

] (hex)

T1 = 5(x2 + x4)

T2 = 0x12(x3 + x4)

T3 = 5(x1 + x3)

T4 = 0x12(x1 + x2)

y1 = x1 + T1 + T2

y2 = x2 + T3 + T2

y3 = x3 + T1 + T4

y3 = x4 + T3 + T4

01 05 12 17

05 01 17 12

12 17 01 05

17 12 05 01

] (hex)

T1 = 5(x2 + x4)

T2 = 0x12(x3 + x4)

T3 = 5(x1 + x3)

T4 = 0x12(x1 + x2)

y1 = x1 + T1 + T2

y2 = x2 + T3 + T2

y3 = x3 + T1 + T4

y3 = x4 + T3 + T4

This matrix requires 12 XORs and four expensive GF multiplications. My version requires just 2.8 XORs instead of these multiplications and can be done with less temporary space (partial blocks rather than full blocks if anything).

Conclusions

The main point here is that the involutory property desired for AES-MDS hinders performance of these matrices when used for erasure codes.

I also wanted to draw attention to the real optimization problem which is different. That is how to reduce the number of intermediate XOR sums that must be computed to get at the product of the multiplication.

One interesting possibility that I've neglected is that it may be possible for larger matrices that temporary products like T2 = 0x12(x3 + x4) can be reused with more structured matrices. I've assumed that reuse is only possible within a GF matrix column. However useful intermediate products might span columns with a more structured matrix.

It is possible there are slightly faster matrices because I didn't search the entire NP-hard problem space. I also neglected the possibility of transposing the bitmatrices.

last edit by catid
edited (>30 days ago) 10:43pm Tue. Nov 25th 2014 PST