Reputation: 1
I am trying to detect near-duplicates using Elasticknn plugin.
I have created minhashes of text documents, with Minhash set size = 100
I want to apply LSH with Jaccard similarity using Elasticknn plugin (because it has this type of index available,)
In my knowledge of LSH, Minhash duplicate detection algorithm, as per the required level of jaccard similarity (say 0.8) we have to choose the
Elastiknn provides some different parameters https://elastiknn.com/api/#jaccard-lsh-mapping
I am not sure if L and k are actually b and r.
Can anybody explain how to tune L and k from Elastiknn to get maximum accuracy for required level of jaccard similar documents?
Upvotes: 0
Views: 717
Reputation: 1323
I am not sure if L and k are actually b and r.
Can you provide a more precise definition of b and r? For example "size" is ambiguous, and "number of buckets" might mean the same thing as "number of hash tables", but maybe not? I looked briefly and don't see any references to b and r in the context of minhash.
Can anybody explain how to tune L and k from Elastiknn to get maximum accuracy for required level of jaccard similar documents?
Parameter tuning is generally a process of trial-and-error. The general guidelines are as described in the docs:
This pattern of OR and AND amplification applies to all of the LSH algos used in Elastiknn. LSH and Amplification are covered more thoroughly here: https://elastiknn.com/posts/tour-de-elastiknn-august-2021/
Upvotes: 1