How to choose Elastiknn LSH Jaccard similarity index parameters L and k ? In my case I have minhash size = 100, and jaccard Similarity = 0.8

Question

I am trying to detect near-duplicates using Elasticknn plugin.

I have created minhashes of text documents, with Minhash set size = 100

I want to apply LSH with Jaccard similarity using Elasticknn plugin (because it has this type of index available,)

In my knowledge of LSH, Minhash duplicate detection algorithm, as per the required level of jaccard similarity (say 0.8) we have to choose the

I am not sure if L and k are actually b and r.

Can anybody explain how to tune L and k from Elastiknn to get maximum accuracy for required level of jaccard similar documents?

Answers (1)