torayeff
torayeff

Reputation: 9702

Content-Seen check in Python

Hello Guys! Let's say I have some page which I got with this script:

page = urllib2.urlopen(url).read()

While crawling web page, how can I efficiently (fast) check whether this content has already been crawled or not? My algorithm is like this:

    seenContents = set()
then check if crawled content is in set or not

But I do not know what to store on that set, hash value or etc? Can you recommend something?

Upvotes: 2

Views: 163

Answers (1)

Maria Zverina
Maria Zverina

Reputation: 11163

How about MD5 of the content?

import md5

contest = "some data"
m = md5.new(contents)
m.digest()

Upvotes: 4

Related Questions