user1068464
user1068464

Reputation: 325

How to go about incremental scraping large sites near-realtime

I want to scrape a lot (a few hundred) of sites, which are basically like bulletin boards. Some of these are very large (up to 1.5 million) and also growing very quickly. What I want to achieve is:

For this we are using scrapy and save the items in a postresql database. The problem right now is, how can I make sure I got all the records without scraping the complete site every time? (Which would not be very agressive traffic-wise, but also not possible to complete within 1 hour.)

For example: I have a site with 100 pages and 10 records each. So I scrape page 1, and then go to page 2. But on fast growing sites, at the time I do the request for page 2, there might be 10 new records, so I would get the same items again. Nevertheless I would get all items in the end. BUT next time scraping this site, how would I know where to stop? I can't stop at the first record I already have in my database, because this might be suddenly on the first page, because there a new reply was made.

I am not sure if I got my point accross, but tl;dr: How to fetch fast growing BBS in an incremental way? So with getting all the records, but only fetching new records each time. I looked at scrapy's resume function and also at scrapinghubs deltafetch middleware, but I don't know if (and how) they can help to overcome this problem.

Upvotes: 1

Views: 528

Answers (1)

Alexis
Alexis

Reputation: 64

For example: I have a site with 100 pages and 10 records each. So I scrape page 1, and then go to page 2. But on fast growing sites, at the time I do the request for page 2, there might be 10 new records, so I would get the same items again. Nevertheless I would get all items in the end. BUT next time scraping this site, how would I know where to stop? I can't stop at the first record I already have in my database, because this might be suddenly on the first page, because there a new reply was made.

Usually each record has a unique link (permalink) e.g. the above question can be accessed by just entering https://stackoverflow.com/questions/39805237/ & ignoring the text beyond that. You'll have to store the unique URL for each record and when you scrape next time, ignore the ones that you already have.

If you take the example of tag python on Stackoverflow, you can view the questions here : https://stackoverflow.com/questions/tagged/python but the sorting order can't be relied upon for ensuring unique entries. One way to scrape would be to sort by newest questions and keep ignoring duplicate ones by their URL.

You can have an algorithm that scrapes first 'n' pages every 'x' minutes until it hits an existing record. The whole flow is a bit site specific, but as you scrape more sites, your algorithm will become more generic and robust to handle edge cases and new sites.

Another approach is to not run scrapy yourself, but use a distributed spider service. They generally have multiple IPs and can spider large sites within minutes. Just make sure you respect the site's robots.txt file and don't accidentally DDoS them.

Upvotes: 1

Related Questions