phatfingers
phatfingers

Reputation: 10250

Is there a way to adjust for integer overflow?

I'm noodling through an anagram hash function, already solved several different ways, but I'm looking for extreme performance as an exercise. I already submitted a solution that passed all the given tests (beating out 100% of all competitors by at least 1ms), but I believe that although it "won", it has a weakness that just wasn't triggered. It is subject to integer overflow in a way that could affect the results.

The gist of the solution was to combine multiple commutative operations, each taking some number of bits, and concatenate them into one long variable. I chose xor, sum, and product. The xor operation cleanly fits within a fixed number of bits. The sum operation might overflow, but because of the way overflow is addressed, it would still arrive at the same result if letters and their corresponding values are rearranged. I wouldn't worry, for example, about whether this function would overflow.

private short sumHash(String s) {
    short hash=0;
    for (char c:s.toCharArray()) {
        hash+=c;
    }
    return hash;
}

Where I run into trouble is in the inclusion of products. If I make a function that returns the product of a list of values (such as character values in a String), then, at the very least, the result could be rendered inaccurate if the product overflowed to exactly zero.

private short productHash(String s) {
    short hash=1;
    for (char c:s.toCharArray()) {
        hash*=c;
    }
    return hash;
}

Is there any safe and performant way to avoid this weakness so that the function gains the benefit of the commutative property of multiplication to produce the same value for anagrams, but can't ever encounter a product that overflows to zero?

Upvotes: 0

Views: 177

Answers (3)

meriton
meriton

Reputation: 70564

You will only hit zero if

a * b = 0 mod 2^64

which is equivalent to there being an integer k such that

a * b = k * 2^64 

That is, we get in trouble if factors divide 2^64, i.e. if factors are even. Therefore, the easiest solution is ensuring that all factors are odd, for instance like this:

for (char ch : chars) {
  hash *= (ch << 1) | 1;
}

This allows you to keep 63 bits of information.

Note however that this technique will only avoid collisions caused by overflow, but not collisions caused by multipliers that share a common factor. If you wish to avoid that, too, you'll need coprime multipliers, which is easiest achieved if they are prime.

Upvotes: 2

Code-Apprentice
Code-Apprentice

Reputation: 83527

The naive way to avoid overflow, is to use a larger type such as int or long. However, for your purposes, modulo arithmetic might make more sense. You can do (a * b) % p for a prime p to maintain commutativity. (There is some deep mathematics here called Group Theory, if you are interested in learning more.) You will need to limit p to be small enough that each a * b does not overflow. The easiest way to do this is to pick a p so that (p - 1)^2 can still be represented in a short or whatever data type you are using.

Upvotes: 1

Louis Wasserman
Louis Wasserman

Reputation: 198083

Sure, if you're willing to go to some lengths to do it. The simplest solution that occurs to me is to write

hash *= primes[c];

where primes is an array that maps each possible character to a distinct odd prime. Overflowing to zero can only happen if the "true" product in infinite-precision arithmetic is a multiple of 2^32, and if you're multiplying by odd primes, that's impossible.

(You do run into the problem that the hash itself will always be odd, but you could shift it right one bit to obtain a more fully mixed hash.)

Upvotes: 3

Related Questions