Dom
Dom

Reputation: 3100

Fast implementation of rolling hash in PHP

I've implemented the Adler32 rolling hash in PHP, but because ord is so slow (about 1MB per second on my dev machine) to get the integer values of chanters in a string, this solution is unworkable for 100MB+ files.

PHP's mhash function can get a very quick calculation for the adler32 (120MB per second on my dev machine). However mhash doesn’t seem to support the rolling nature of adler32, so you have to calculate a whole new adler32 as the rolling window moves rather than just recalculate the hash for the two bytes which have actually changed.

I'm not tied to the adler32 algorithm, I just need a very fast rolling hash in PHP.

Upvotes: 0

Views: 518

Answers (2)

池下克彦
池下克彦

Reputation: 11

With unpack you can get an array of integer values of characters as an array. Note that the index starts from 1, not 0.

Example:

$contents = "addadda";
$ords = array_values(unpack("C*", $contents)); // make 0-based array 
$a = 1; $b = 0; // hash low and high words
$len = 4; // the window length
foreach ($ords as $i => $ord) {
    if ($i < $len) {
        $a = ($a + $ord) % 65521;
        $b = ($b + $a) % 65521;
    } else {
        $removed = $ords[$i - $len];
        $a = ($a + $ord - $removed + 65521) % 65521;
        $b = ($b + $a - 1 - $len * $removed + 65521) % 65521;
    }
    if ($i >= $len - 1) {
        echo $i - $len + 1, "..", $i, ": ",
            substr($contents, $i - $len + 1, $len), " => ",
            $b * 65536 + $a, "\n";
    }
}

Result:

0..3: adda => 64815499
1..4: ddad => 65405326
2..5: dadd => 65208718
3..6: adda => 64815499

Upvotes: 0

Mark Adler
Mark Adler

Reputation: 112547

Call the low two bytes of the Adler-32 A and the high two bytes B, where that is the Adler-32 of the sequence {x1, x2, ..., xn}.

To get the Adler-32 of {x2, ..., xn}, subtract x1 from A, modulo 65521, and subtract n * x1 + 1 from B, again modulo 65521.

Note that if your window size n happens to be a multiple of 65521, then you can just subtract one from B (modulo 65521). So that might be a good window size to pick, if you can. Also note that if n is larger than 65521, then you can multiply x1 by (n modulo 65521) instead. If n is a constant, you can do that modulo operation ahead of time.

(Note that % operator in C and PHP is not the modulo operation, but rather the remainder operation. So you need to take care with negative numbers.)

Upvotes: 1

Related Questions