In the past when I've looked at checksums they've only had a power of two number of symbols, like 2^5, 2^32, etc. So the go-to choice has been a CRC. Recently I had reason to look at symbol counts other than a power of two.

First I resisted and implemented a CRC with 5 bits:

This is based on the USB CRC polynomial, and it reverses the bit order of the input to simplify the function as much as possible. Simplicity was a design goal.

Then I relented and worked on a base-34 CRC. Since it wasn't a power of two, I wasn't sure what to do, so I developed this one:

I developed the function first and then worked on searching the 34! options for the table composition. What I found is that for this function it's possible to cover all single character errors easily, and all but one transposition error! Moreover, I can quickly design the table to choose which transposition error I want to tolerate, like 3 <-> 27 etc. I chose 0 <-> 33.

Doing some more research I found these:

http://en.wikipedia.org/wiki/Damm_algorithm

The Damm algorithm uses a magic square lookup table that would be 34 x 34 elements in size. This table is too big for me to use unfortunately. But it has the advantage of not tolerating any transposition errors.

http://en.wikipedia.org/wiki/Verhoeff_algorithm

The Verhoeff algorithm page has some great statistics on common error types. In particular he found that 90% of errors are either single-digit errors or transposition errors. The other 10 common errors account for about 1% of errors each.

Verhoeff's algorithm is similarly overly complicated so I didn't go with it either.

Mainly what I learned is that focusing on transposition errors turns out to be a good idea for optimizing the tables, and the rest requires lots of complexity for greatly diminishing returns.

For perspective a base 34 random hash would provide 97% error detection. My transposition optimization provides 99.8% error detection for transposition errors, 100% detection for single errors (by far the most common error), and 97% for other types. Not bad at all.. :)

First I resisted and implemented a CRC with 5 bits:

static const char CRC5_TAB[32] = {

0, 11, 22, 29, 5, 14, 19, 24, 10, 1, 28, 23, 15, 4, 25, 18,

20, 31, 2, 9, 17, 26, 7, 12, 30, 21, 8, 3, 27, 16, 13, 6

};

int crc5(const unsigned char* data, int len)

{

int sum = CRC5_TAB[data[0] & 31];

for (int i = 1; i < len; ++i)

sum = CRC5_TAB[sum ^ data[i] & 31];

return sum;

}

0, 11, 22, 29, 5, 14, 19, 24, 10, 1, 28, 23, 15, 4, 25, 18,

20, 31, 2, 9, 17, 26, 7, 12, 30, 21, 8, 3, 27, 16, 13, 6

};

int crc5(const unsigned char* data, int len)

{

int sum = CRC5_TAB[data[0] & 31];

for (int i = 1; i < len; ++i)

sum = CRC5_TAB[sum ^ data[i] & 31];

return sum;

}

This is based on the USB CRC polynomial, and it reverses the bit order of the input to simplify the function as much as possible. Simplicity was a design goal.

Then I relented and worked on a base-34 CRC. Since it wasn't a power of two, I wasn't sure what to do, so I developed this one:

// Catches all single character errors.

// Catches all transposition errors except 0 and 33 (Z).

static char CRC34_TAB[34] = {

23, 19, 10, 3, 15, 32, 27, 4, 6, 1, 24, 5, 17, 14, 31, 0, 12,

26, 25, 21, 33, 16, 13, 11, 2, 29, 8, 30, 9, 28, 20, 7, 18, 22,

};

int crc34(const char* data, int len)

{

int sum = data[0];

for (int i = 1; i < len; ++i)

{

sum = (CRC34_TAB[sum] + data[i]) % 34;

}

return sum;

}

// Catches all transposition errors except 0 and 33 (Z).

static char CRC34_TAB[34] = {

23, 19, 10, 3, 15, 32, 27, 4, 6, 1, 24, 5, 17, 14, 31, 0, 12,

26, 25, 21, 33, 16, 13, 11, 2, 29, 8, 30, 9, 28, 20, 7, 18, 22,

};

int crc34(const char* data, int len)

{

int sum = data[0];

for (int i = 1; i < len; ++i)

{

sum = (CRC34_TAB[sum] + data[i]) % 34;

}

return sum;

}

I developed the function first and then worked on searching the 34! options for the table composition. What I found is that for this function it's possible to cover all single character errors easily, and all but one transposition error! Moreover, I can quickly design the table to choose which transposition error I want to tolerate, like 3 <-> 27 etc. I chose 0 <-> 33.

Doing some more research I found these:

http://en.wikipedia.org/wiki/Damm_algorithm

The Damm algorithm uses a magic square lookup table that would be 34 x 34 elements in size. This table is too big for me to use unfortunately. But it has the advantage of not tolerating any transposition errors.

http://en.wikipedia.org/wiki/Verhoeff_algorithm

The Verhoeff algorithm page has some great statistics on common error types. In particular he found that 90% of errors are either single-digit errors or transposition errors. The other 10 common errors account for about 1% of errors each.

Verhoeff's algorithm is similarly overly complicated so I didn't go with it either.

Mainly what I learned is that focusing on transposition errors turns out to be a good idea for optimizing the tables, and the rest requires lots of complexity for greatly diminishing returns.

For perspective a base 34 random hash would provide 97% error detection. My transposition optimization provides 99.8% error detection for transposition errors, 100% detection for single errors (by far the most common error), and 97% for other types. Not bad at all.. :)