theitpushover
theitpushover

Reputation: 23

How to determine upper bound of c when estimating jaccard similarity between documents?

Let's say I've a million documents that I preprocessed (calculated signatures for using minhash) in O(D*sqrt(D)) time where D is the number of documents. When I'm given a query document, I've to return the first of the million preprocessed documents in O(sqrt(D)) time such that the jaccard similarity is greater than or equal to, say, 0.8.

If there's no document similar to the query document enough to reach that score, I've to return a document with similarity at least c * 0.8 (where c<1) with probability at least 1 - 1/e^2. How may I find the maximum value of C for this minhash scheme?

Upvotes: 0

Views: 186

Answers (1)

Ben Whitmore
Ben Whitmore

Reputation: 967

Your orders of complexity/time don't sound right. Calculating the minhashes (signature) for a document should be roughly O(n), where n is the number of features (e.g., words, or shingles).

Finding all similar documents to a given document (with estimated similarity above a given threshold) should be roughly O(log(n)), where n is the number of candidate documents.

A document with (estimated) minimum .8 jaccard similarity will have at least 80% of its minhashes matching the given document. You haven't defined c and e for us, so I can't tell what your minimum threshold is -- I'll leave that to you -- but you can easily achieve this efficiently in a single pass:

Work through all your base document's hashes one by one. For each hash, look in your hash dictionary for all other docs that share that hash. Keep a tally for each document found of how many hashes it shares. As soon as one of these tallies reaches 80% of the total number of hashes, you have found the winning document and can halt calculations. But if none of the tallies ever reach the .8 threshold, then continue to the end. Then you can choose the document with the highest tally and decide whether that passes your minimum threshold.

Upvotes: 0

Related Questions