Reputation: 27
When researching the SimHash
algorithm for checking similarities between two documents, a few questions sprung up:
SimHash
?SimHash
only work on text documents? Can I hash binary data and expect it to work just as well (with the right feature vector representation)?Upvotes: 0
Views: 240
Reputation: 416
The SimHash algorithm allows the computation of fingerprints (also called signatures) for sets of elements. These fingerprints can then be used to estimate the cosine similarity of the original sets. Thus, the SimHash algorithm is not limited to text documents. It can be used for any object that can be mapped to a set representation if the corresponding cosine similarity is a meaningful measure of object similarity.
GPS routes, for example, could be represented as a set of cells in a rasterized map. The cosine similarity between sets of cells could be a measure of the similarity of different GPS routes.
A common method for mapping text documents to sets is tokenization, in which the text is decomposed into words or n-grams. Removing stop words that are likely to occur in each text document can increase the contrast of the cosine similarity.
Upvotes: 0