michaelliu
michaelliu

Reputation: 1697

Murmur3 hash different result between Python and Java implementation

I have two different program that wish to hash same string using Murmur3 in Python and Java respectively.

Python version 2.7.9:

mmh3.hash128('abc')

Gives 79267961763742113019008347020647561319L.

Java is Guava 18.0:

HashCode hashCode = Hashing.murmur3_128().newHasher().putString("abc", StandardCharsets.UTF_8).hash();

Gives string "6778ad3f3f3f96b4522dca264174a23b", converting to BigInterger gives 137537073056680613988840834069010096699.

How to get same result from both?

Thanks

Upvotes: 5

Views: 7896

Answers (2)

liiight
liiight

Reputation: 366

If anyone is interested in the reverse answer, converting the python output to the Java output:

import mmh3
import string

char_array = '0123456789abcdef'
mumrmur = mmh3.hash_bytes('abc')

result = [f'{string.hexdigits[(char >> 4) & 0xf]}{string.hexdigits[char & 0xf]}' for char in mumrmur]
print(''.join(result))

Upvotes: 6

ColinD
ColinD

Reputation: 110104

Here's how to get the same result from both:

byte[] mm3_le = Hashing.murmur3_128().hashString("abc", UTF_8).asBytes();
byte[] mm3_be = Bytes.toArray(Lists.reverse(Bytes.asList(mm3_le)));
assertEquals("79267961763742113019008347020647561319",
    new BigInteger(mm3_be).toString());

The hash code's bytes need to be treated as little endian but BigInteger interprets bytes as big endian. You were presumably using new BigInteger(hex, 16) to create the BigInteger, but the output of HashCode.toString() is actually a series of pairs of hexadecimal digits representing the hash bytes in the same order they're returned by asBytes() (little endian). (You can also reverse those pairs of hexadecimal to get a hex number that does produce the same result when passed to new BigInteger(reversedHex, 16)).

I think the documentation of toString() is somewhat confusing because of the way it refers to "big endian"; it doesn't actually mean that the output of the method is the hexadecimal number representing the bytes interpreted as big endian.

We have an open issue for adding asBigInteger() to HashCode.

Upvotes: 10

Related Questions