Reputation: 6093
I am currently working on document clustering using MinHashing
technique. However, I am not getting desired results as MinHash is a rough estimation of Jaccard similarity
and it doesn't suits my requirement.
This is my scenario:
I have a huge set of books and if a single page is given as a query, I need to find the corresponding book from which this page is obtained from. The limitation is, I have features for the entire book and it's impossible to get page-by-page features for the books. In this case, Jaccard similarity is giving poor results if the book is too big. What I really want is the distance between query page and the books (not vice-versa). That is:
Given 2 sets A, B: I want the distance from A to B,
dis(A->B) = (A & B)/A
Is there similar distance metric that gives distance from set A to set B. Further, is it still possible to use MinHashing
algorithm with this kind of similarity metric?
Upvotes: 0
Views: 257
Reputation: 13076
We can estimate your proposed distance function using a similar approach as the MinHash algorithm.
For some hash function h(x)
, compute the minimal values of h
over A
and B
. Denote these values h_min(A)
and h_min(B)
. The MinHash algorithm relies on the fact that the probability that h_min(A) = h_min(B)
is (A & B) / (A | B)
. We may observe that the probability that h_min(A) <= h_min(B)
is A / (A | B)
. We can then compute (A & B) / A
as the ratio of these two probabilities.
Like in the regular MinHash algorithm, we can approximate these probabilities by repeated sampling until the desired variance is achieved.
Upvotes: 1