Charles
Charles

Reputation: 495

how to set the seed value for ruby murmur hash

Is there a way to set the seed value for using the ruby hash function (i.e. murmur hash in 1.9, don't know JRuby?) so that I can get the same hash code every time I run the script (i.e. in parallel on multiple processes or on different nodes)

so that

puts "this is a test".hash

is the same whenever I run this , today, tomorrow, 3 weeks from now, etc

I want to do this so I can implement MinHash in parallel

I can see in the murmur_hash gem that the murmur hash accept a seed so I assume I can set the seed and get the hash code deterministically whenever I choose the same seed

Upvotes: 2

Views: 2464

Answers (2)

ming
ming

Reputation: 31

try this seed 0xbc9f1d34, from jeff dean's LevelDB source code, :)

Upvotes: 3

cevaris
cevaris

Reputation: 5794

Reviving this if anyones wants to know...

You can use the murmurhash3 gem located here.

You can override the hash function built into String class.

require 'murmurhash3'
class String

  SEED = 12345678

  def hash
    MurmurHash3::V32.str_hash(self,SEED)
  end
end

No you can use this hash function on any string.

$ irb
2.1.1 :001 > "this is a test".hash
=> 553036434 

Assuming you use the same seed 12345678, then you should repeatedly get the same hash on any server, process, thread.

MurmurHash in Parallel

You can parallel gem located here

Then simply pass the list of items you want to be executed/hashed in parallel.

items_to_hash = ['val0', 'val1',...., 'valN']
results = Parallel.map(items_to_hash) do |item|
   item.hash
end

If you not into using another gem to execute the hashes in parallel, then here is an example to use vanilla Ruby to get you going.
http://t-a-w.blogspot.com/2010/05/very-simple-parallelization-with-ruby.html

Upvotes: 1

Related Questions