Reputation: 69
I was building a crawler for a search engine last year and we had the problem of handling page recency ; pages change over time and we need to keep track of this change and re-crawl these pages when we know that they have changed their contents.
So, we asked our professor for a solution to this problem and he told us to look for sitemaps of these pages. We found that not all pages contain sitemaps that could help us with this problem, we told him that so he told us a -somehow weird- solution to recrawl everything after a random value of time.
That said I've tried looking into the problem and I haven't found anything that could help. So to solve this problem with minimum efficiency I've stored a hashed value of every page I've crawled, and then when re-crawling after this random time I check for the current page hashed value and compare the last saved hashed value. If there's a difference I re-crawl this page.
I want to know if there's a more efficient way to keep track of pages recency
Upvotes: 0
Views: 42
Reputation: 36319
Well, it depends on if the pages are using conventions or not. Most major websites will use cache control headers(or last-modified and ETag). Those should tell you when a page changes, if the site you're crawling uses them. So I think the most broadly scoped and efficient way to do this would be to check for those things and use them if they exist. If they don't exist, then you can use your page hash approach, though even getting a page hash might not work as expected if (for example) the site in question dynamically renders some minor changes on the server (e.g. current date / time, rendering time, etc).
Upvotes: 2