Set distance as similarity metric for MinHashing algorithm

Question

I am currently working on document clustering using MinHashing technique. However, I am not getting desired results as MinHash is a rough estimation of Jaccard similarity and it doesn't suits my requirement.

This is my scenario:

I have a huge set of books and if a single page is given as a query, I need to find the corresponding book from which this page is obtained from. The limitation is, I have features for the entire book and it's impossible to get page-by-page features for the books. In this case, Jaccard similarity is giving poor results if the book is too big. What I really want is the distance between query page and the books (not vice-versa). That is:

Given 2 sets A, B: I want the distance from A to B,

dis(A->B) =  (A & B)/A

Is there similar distance metric that gives distance from set A to set B. Further, is it still possible to use MinHashing algorithm with this kind of similarity metric?

Set distance as similarity metric for MinHashing algorithm

Answers (1)

Related Questions