Reputation: 355
I have some webpages where I'm collecting data over time. I don't care about the content itself, just whether the page has changed.
Currently, I use Python's requests.get to fetch a page, hash the page (md5), and store that hash value to compare in the future.
Is there a computationally cheaper or smaller-storage strategy for this? Things work now; I just wanted to check if there's a better/cheaper way. :)
Upvotes: 0
Views: 108
Reputation: 3637
A hash would be the most trustable source of change detection. I would use CRC32. It's only 32 bits as opposed to 128bits for md5. Also, even in browser Javascript it can be very fast. I have personal experience in improving the speed for a JS implementation of CRC32 for very large datasets.
Upvotes: 0
Reputation: 9162
You can keep track of the date of the last version you got and use the If-Modified-Since header in your request. However, some resources ignore that header. (In general it's difficult to handle it for dynamically-generated content.) In that case you'll have to fall back to less efficient method.
Upvotes: 2