binarysolo
binarysolo

Reputation: 355

Efficient way to check whether a page changed (while storing as little info as possible)?

I have some webpages where I'm collecting data over time. I don't care about the content itself, just whether the page has changed.

Currently, I use Python's requests.get to fetch a page, hash the page (md5), and store that hash value to compare in the future.

Is there a computationally cheaper or smaller-storage strategy for this? Things work now; I just wanted to check if there's a better/cheaper way. :)

Upvotes: 0

Views: 108

Answers (2)

ravenac95
ravenac95

Reputation: 3637

A hash would be the most trustable source of change detection. I would use CRC32. It's only 32 bits as opposed to 128bits for md5. Also, even in browser Javascript it can be very fast. I have personal experience in improving the speed for a JS implementation of CRC32 for very large datasets.

Upvotes: 0

morningstar
morningstar

Reputation: 9162

You can keep track of the date of the last version you got and use the If-Modified-Since header in your request. However, some resources ignore that header. (In general it's difficult to handle it for dynamically-generated content.) In that case you'll have to fall back to less efficient method.

Upvotes: 2

Related Questions