Which is the better way to use Scrapy to crawl 1000 sites？

Question

I'd like to hear the diffrences between 3 different approaches for using Scrapy in order to crawl 1000 sites.

For example, I want to scrape 1000 photo sites, they all most has the same structure.Like have one kind of photo list page,and other kind of big photo page; but these list or photo desc page's HTML code will not all the same.

Another example,I want to scrape 1000 wordpress blog,Only bolg's article.

The first, is exploring the entire 1000 sites using one scrapy project.
The second, is having all these 1000 sites under the same scrapy project, all items in items.py, each site having it's own spider.
The third is similar to the second, but having one spider for all the sites instead of seperating them.

What are the diffrences, and which do you think is the right approach? Is there any other, better approach I've missed?

Brainhash · Accepted Answer

I had 90 sites to pull from so it wasn't great option to create one crawler per site. The idea was to be able to run in parallel. Also i had split this to pack similar page formats in one place.

So I ended up with 2 crawlers:

Crawler 1 - URL Extractor. This would extract all detail page URLs from top level listing page in a file(s).
Crawler 2 - Fetch Details. This would read from the URL file and extract item details.

This allowed me to fetch URLs first and estimate number of threads that i might need for second crawler.

Since each crawler was working on specific page format, there were quite a few functions I could reuse.

Which is the better way to use Scrapy to crawl 1000 sites？

Answers (1)

Related Questions