Reputation: 9702
Hello Guys! Let's say I have some page which I got with this script:
page = urllib2.urlopen(url).read()
While crawling web page, how can I efficiently (fast) check whether this content has already been crawled or not? My algorithm is like this:
seenContents = set()
then check if crawled content is in set or not
But I do not know what to store on that set, hash value or etc? Can you recommend something?
Upvotes: 2
Views: 163
Reputation: 11163
How about MD5 of the content?
import md5
contest = "some data"
m = md5.new(contents)
m.digest()
Upvotes: 4