How to go about incremental scraping large sites near-realtime

Question

I want to scrape a lot (a few hundred) of sites, which are basically like bulletin boards. Some of these are very large (up to 1.5 million) and also growing very quickly. What I want to achieve is:

scrape all the existing entries
scrape all the new entries near real-time (ideally around 1 hour intervals or less)

For this we are using scrapy and save the items in a postresql database. The problem right now is, how can I make sure I got all the records without scraping the complete site every time? (Which would not be very agressive traffic-wise, but also not possible to complete within 1 hour.)

For example: I have a site with 100 pages and 10 records each. So I scrape page 1, and then go to page 2. But on fast growing sites, at the time I do the request for page 2, there might be 10 new records, so I would get the same items again. Nevertheless I would get all items in the end. BUT next time scraping this site, how would I know where to stop? I can't stop at the first record I already have in my database, because this might be suddenly on the first page, because there a new reply was made.

I am not sure if I got my point accross, but tl;dr: How to fetch fast growing BBS in an incremental way? So with getting all the records, but only fetching new records each time. I looked at scrapy's resume function and also at scrapinghubs deltafetch middleware, but I don't know if (and how) they can help to overcome this problem.

How to go about incremental scraping large sites near-realtime

Answers (1)

Related Questions