Reputation: 198547
I'm doing some web crawling type stuff where I'm looking for certain terms in webpages and finding their location on the page, and then caching it for later use. I'd like to be able to check the page periodically for any major changes. Something like md5 can be foiled by simply putting the current date and time on the page.
Are there any hashing algorithms that work for something like this?
Upvotes: 10
Views: 401
Reputation: 5848
http://www.phash.org/ did something like this for images. The jist: Take an image, blur it, convert it to greyscale, do a discrete cosine transform, and look at just the upper left quadrant of the result (where the important information is). Then record a 0 for each value less than the average and 1 for each value more than the average. The result is pretty good for small changes.
Min-Hashing is another possibility. Find features in your text and record them as a value. Concatenate all those values to make a hash string.
For both of the above, use a vantage point tree so that you can search for near-hits.
Upvotes: 1
Reputation: 29047
This might be a good place to use the Levenshtein distance metric, which quantifies the amount of editing required to transform one sequence into another.
The drawback of this approach is that you'd need to keep the full text of each page so that you could compare them later. With a hash-based approach, on the other hand, you simply store some sort of small computed value and don't require the previous full text for comparison.
You also might try some sort of hybrid approach--let a hashing algorithm tell you that any change has been made, and use it as a trigger to retrieve an archival copy of the document for more rigorous (Levenshtein) comparison.
Upvotes: 3
Reputation: 6078
I am sorry to say, but hash algorithms are precisely. Theres none capable of be tolerant of minor differences. You should take another approach.
Upvotes: -4
Reputation: 133975
A common way to do document similarity is shingling, which is somewhat more involved than hashing. Also look into content defined chunking for a way to split up the document.
I read a paper a few years back about using Bloom filters for similarity detection. Using Bloom Filters to Refine Web Search Results. It's an interesting idea, but I never got around to experimenting with it.
Upvotes: 11