Efficient way to check whether a page changed (while storing as little info as possible)?

Question

I have some webpages where I'm collecting data over time. I don't care about the content itself, just whether the page has changed.

Currently, I use Python's requests.get to fetch a page, hash the page (md5), and store that hash value to compare in the future.

Is there a computationally cheaper or smaller-storage strategy for this? Things work now; I just wanted to check if there's a better/cheaper way. :)

ravenac95 · Accepted Answer

A hash would be the most trustable source of change detection. I would use CRC32. It's only 32 bits as opposed to 128bits for md5. Also, even in browser Javascript it can be very fast. I have personal experience in improving the speed for a JS implementation of CRC32 for very large datasets.

Efficient way to check whether a page changed (while storing as little info as possible)?

Answers (2)

Related Questions