What value to use for numHashTable in Spark LSH by Uber?

Question

I'm trying to use .approxSimilarityJoin of Spark MLlib LSH: MinHash for Jaccard Distance e.g.

val mh = new MinHashLSH()
    .setNumHashTables(5)
    .setInputCol("features")
    .setOutputCol("hashes")

I understand that the higher the numHashTables, the more accurate the system, and the more complex/slow the calculation. I have two questions about the parameters:

What's the relationship between numHashTables and the MinHash fingerprint size?
How do I set the value correctly?

NOTE: I believe that the algorithm has been added to MLlib by Uber: https://eng.uber.com/lsh/

What value to use for numHashTable in Spark LSH by Uber?

Answers (1)

Related Questions