Jack
Jack

Reputation: 16724

A hash function like from K&R book

Consider this function:

unsigned hash(char *s)
{
  char *p;
  unsigned hashval;
  for(p = s; *p; p++)
    hashval = *p + 31 * hashval;
  return hashval;
}

How can I measure how many bytes in s will returns a wrong result,such as overflow? I'm on 32-bit platform.

Upvotes: 0

Views: 2213

Answers (2)

zwol
zwol

Reputation: 140569

If you change it to read

unsigned hash(const char *s)
{
  const unsigned char *p;
  unsigned hashval = 0;
  for (p = (const unsigned char *) s; *p; p++)
    hashval = *p + 31u * hashval;
  return hashval;
}

then there is no longer any possibility of undefined behavior due to integer overflow, because all types involved in arithmetic are unsigned, so everything wraps mod 2n (where n is the width of unsigned in bits). I have also fixed the use of an uninitialized variable, and made s and p const, which may improve optimization and/or catch mistakes in the body of the function.

(I'm not remembering the exact arithmetic conversion rules right now; it might not have been possible in the first place. However, writing it this way makes it obviously impossible.)

BTW, there are much better hash functions known nowadays: if you don't have a strong reason to do otherwise I recommend using SipHash.

Upvotes: 5

Floris
Floris

Reputation: 46375

A couple of thoughts:

First, overflow is expected in a hash function.

Second, since your function includes a 31*hashval, and every element in string must have at least a value of 1, you would expect that the longest string you can have before you hit overflow is a string of all \x01, and it will overflow the hash when it gets to a length of 6 (since the *31 operation distributes the entire number over the 5 bits to the left, there is going to be carry, meaning you're likely to affect the sixth bit, and 6*6 = 36 > 32). The number will be less when the bytes are larger (the first byte pretty much defines the behavior - when it is large you may get overflow after five bytes). It is easier to show this with the real bits and bytes. I'm going to use a *32 rather than a *31 algorithm (not quite right, but less carry to worry about, al you'll get the idea):

byte      hash is less than:
0000a000  00000000 00000000 00000000 0000a000
10000000  00000000 00000000 000000a0 10000000
b0000000  00000000 00000000 a0100000 b0000000
c0000000  00000000 00a01000 00b00000 c0000000
d0000000  0000a010 0000b000 00c00000 d0000000
anything  OVERFLOW!

As was pointed out above, you can improve the predictable behavior of your (rather poor) hashing algorithm by declaring everything as unsigned integer; I would also recommend initializing the hash (and a value other than zero might be a good idea), rather than assuming that the compiler will set it to zero (I am not 100% sure that's defined behavior). Finally, if you are wondering about overflow, and want to get a warning, I would modify the code as follows:

for(p = s; *p; p++) {
    if((hashval > 0xFFFFFFFF/31) || (*p>>1 + 31 * (hashval>>1)) > 0x7FFFFFFF) {
        printf("hash is about to overflow at character %c\n", *p);
    }
    hashval = *p + 31 * hashval;
}

Upvotes: 3

Related Questions