Code Poetry
and Text Adventures

by catid posted (>30 days ago) 8:30pm Thu. Dec 4th 2014 PST
In the past when I've looked at checksums they've only had a power of two number of symbols, like 2^5, 2^32, etc.  So the go-to choice has been a CRC.  Recently I had reason to look at symbol counts other than a power of two.

First I resisted and implemented a CRC with 5 bits:

static const char CRC5_TAB[32] = {
   0, 11, 22, 29, 5, 14, 19, 24, 10, 1, 28, 23, 15, 4, 25, 18,
   20, 31, 2, 9, 17, 26, 7, 12, 30, 21, 8, 3, 27, 16, 13, 6
};

int crc5(const unsigned char* data, int len)
{
   int sum = CRC5_TAB[data[0] & 31];

   for (int i = 1; i < len; ++i)
      sum = CRC5_TAB[sum ^ data[i] & 31];

   return sum;
}

This is based on the USB CRC polynomial, and it reverses the bit order of the input to simplify the function as much as possible.  Simplicity was a design goal.

Then I relented and worked on a base-34 CRC.  Since it wasn't a power of two, I wasn't sure what to do, so I developed this one:

// Catches all single character errors.
// Catches all transposition errors except 0 and 33 (Z).
static char CRC34_TAB[34] = {
   23, 19, 10, 3, 15, 32, 27, 4, 6, 1, 24, 5, 17, 14, 31, 0, 12,
   26, 25, 21, 33, 16, 13, 11, 2, 29, 8, 30, 9, 28, 20, 7, 18, 22,
};

int crc34(const char* data, int len)
{
   int sum = data[0];

   for (int i = 1; i < len; ++i)
   {
      sum = (CRC34_TAB[sum] + data[i]) % 34;
   }

   return sum;
}

I developed the function first and then worked on searching the 34! options for the table composition.  What I found is that for this function it's possible to cover all single character errors easily, and all but one transposition error!  Moreover, I can quickly design the table to choose which transposition error I want to tolerate, like 3 <-> 27 etc.  I chose 0 <-> 33.

Doing some more research I found these:

http://en.wikipedia.org/wiki/Damm_algorithm

The Damm algorithm uses a magic square lookup table that would be 34 x 34 elements in size.  This table is too big for me to use unfortunately.  But it has the advantage of not tolerating any transposition errors.

http://en.wikipedia.org/wiki/Verhoeff_algorithm

The Verhoeff algorithm page has some great statistics on common error types.  In particular he found that 90% of errors are either single-digit errors or transposition errors.  The other 10 common errors account for about 1% of errors each.

Verhoeff's algorithm is similarly overly complicated so I didn't go with it either.

Mainly what I learned is that focusing on transposition errors turns out to be a good idea for optimizing the tables, and the rest requires lots of complexity for greatly diminishing returns.

For perspective a base 34 random hash would provide 97% error detection.  My transposition optimization provides 99.8% error detection for transposition errors, 100% detection for single errors (by far the most common error), and 97% for other types.  Not bad at all.. :)