Reputation: 18611
I'm trying to use .approxSimilarityJoin
of Spark MLlib LSH: MinHash for Jaccard Distance e.g.
val mh = new MinHashLSH()
.setNumHashTables(5)
.setInputCol("features")
.setOutputCol("hashes")
I understand that the higher the numHashTables, the more accurate the system, and the more complex/slow the calculation. I have two questions about the parameters:
NOTE: I believe that the algorithm has been added to MLlib by Uber: https://eng.uber.com/lsh/
Upvotes: 4
Views: 1880
Reputation: 1
I think numHashTables is just the MinHash fingerprint size. numHashTables may be a experience parameter, It depends on your scene, and b * r = numHashTables (r=1,recently)
Upvotes: 0