Marsellus Wallace
Marsellus Wallace

Reputation: 18611

What value to use for numHashTable in Spark LSH by Uber?

I'm trying to use .approxSimilarityJoin of Spark MLlib LSH: MinHash for Jaccard Distance e.g.

val mh = new MinHashLSH()
    .setNumHashTables(5)
    .setInputCol("features")
    .setOutputCol("hashes")

I understand that the higher the numHashTables, the more accurate the system, and the more complex/slow the calculation. I have two questions about the parameters:

NOTE: I believe that the algorithm has been added to MLlib by Uber: https://eng.uber.com/lsh/

Upvotes: 4

Views: 1880

Answers (1)

min fan
min fan

Reputation: 1

I think numHashTables is just the MinHash fingerprint size. numHashTables may be a experience parameter, It depends on your scene, and b * r = numHashTables (r=1,recently)

Upvotes: 0

Related Questions