Masti
Masti

Reputation: 151

Murmurhash3 between Java and C++ is not aligning

I have 2 separate applications one in Java and the other is C++. I am using Murmurhash3 for both. However, in C++ I get a different result as compared to Java for the same string

Here is the one from C++: https://code.google.com/p/smhasher/source/browse/trunk/MurmurHash3.cpp?r=144

I am using the following function:

void MurmurHash3_x86_32 ( const void * key, int len,
                      uint32_t seed, void * out )

Here is the one for Java: http://search-hadoop.com/c/HBase:hbase-common/src/main/java/org/apache/hadoop/hbase/util/MurmurHash3.java||server+void+%2522hash

There are many versions of the same Java code above.

This is how I am making a call for Java:

String s = new String("b2622f5e1310a0aa14b7f957fe4246fa");
System.out.println(MurmurHash3.murmurhash3_x86_32(s.getBytes(), 0, s.length(), 2147368987));

The output I get from Java: -1868221715

The output I get from C++ 3297211900

When I tested for some other sample strings like "7c6c5be91430a56187060e06fd64dcb8" and "7e7e5f2613d0a2a8c591f101fe8c7351" they match in Java and C++.

Any pointers are appreciated

Upvotes: 0

Views: 2775

Answers (2)

joy
joy

Reputation: 11

I had the same problem with you. But the Java version of my Murmurhash3 is different from yours. After making some changes to the C++ version of Murmurhash3, I made the hash values generated from the two versions the same. I give you my solution, which you can use to check if it also works for you.

Maybe the biggest difference between the Java and C++ versions lies in the right shift operation(in Java you can see >> and >>>, while in C++ you can only see >>). The integers in Java are all signed, while in C++ you can use signed or unsigned integers. In Java version, the >> means arithmetic right shift and the >>> means logical right shift. And in C++, the >> means arithmetic right shift. The original C++ version of Murmurhash3 uses unsigned integer, and in order to generate the negative hash value like in Java, first in C++ you should change all the unsigned type uint32_t to signed type int32_t. Then you should locate the >>> in Java and make changes around the corresponding >> in C++. For me, I change from:

inline uint32_t rotl32 ( uint32_t x, int8_t r )
{
  return (x << r) | (x >> (32 - r));
}

to:

inline int32_t rotl32 ( int32_t x, int8_t r )
{
  return (x << r) | (int32_t)((uint32_t)x >> (32 - r)); //similar to >>> in Java
}

and from:

FORCE_INLINE uint32_t fmix32 ( uint32_t h )
{
  h ^= h >> 16;
  h *= 0x85ebca6b;
  h ^= h >> 13;
  h *= 0xc2b2ae35;
  h ^= h >> 16;

  return h;
}

to:

FORCE_INLINE int32_t fmix32 ( int32_t h )
{
  h ^= (int32_t)((uint32_t)h >> 16); // similar to >>> in Java
  h *= 0x85ebca6b;
  h ^= (int32_t)((uint32_t)h >> 13);
  h *= 0xc2b2ae35;
  h ^= (int32_t)((uint32_t)h >> 16);

  return h;
}

In this way, my two versions of Murmurhash3 in Java and C++ generate the same hash value.

Upvotes: 1

David Conrad
David Conrad

Reputation: 16399

There are two problems I can see. First, C++ is using uint32_t, and giving you a value of 3,297,211,900. This number is larger than can fit in a signed 32-bit int, and Java uses only signed integers. However, -1,868,221,715 is not equal to 3,297,211,900, even accounting for the difference between signed and unsigned ints.

(In Java 8 they have added Integer.toUnsignedString(int), which will convert a signed 32-bit int to its unsigned string representation. In earlier versions of Java, you can cast the int to a long and then mask off the high bits: ((long) i) & 0xffffffffL.)

The second problem is that you are using the wrong version of getBytes(). The one that takes no argument converts a Unicode String to a byte[] using the default platform encoding, which may vary depending on how your system is set up. It could be giving you UTF-8, Latin1, Windows-1252, KOI8-R, Shift-JIS, EBCDIC, etc.

Never, ever, ever call the no arguments version of String.getBytes(), under any circumstances. It should be deprecated, decimated, defenestrated, destroyed, and deleted.

Use s.getBytes("UTF-8") (or whatever encoding you're expecting to get) instead.

As the Zen of Python says, "Explicit is better than implicit."

I can't tell if there may be any other problems beyond these two.

Upvotes: 2

Related Questions