jianbing Ma
jianbing Ma

Reputation: 375

Which is the better way to use Scrapy to crawl 1000 sites?

I'd like to hear the diffrences between 3 different approaches for using Scrapy in order to crawl 1000 sites.

For example, I want to scrape 1000 photo sites, they all most has the same structure.Like have one kind of photo list page,and other kind of big photo page; but these list or photo desc page's HTML code will not all the same.

Another example,I want to scrape 1000 wordpress blog,Only bolg's article.

What are the diffrences, and which do you think is the right approach? Is there any other, better approach I've missed?

Upvotes: 0

Views: 342

Answers (1)

Brainhash
Brainhash

Reputation: 141

I had 90 sites to pull from so it wasn't great option to create one crawler per site. The idea was to be able to run in parallel. Also i had split this to pack similar page formats in one place.

So I ended up with 2 crawlers:

  • Crawler 1 - URL Extractor. This would extract all detail page URLs from top level listing page in a file(s).
  • Crawler 2 - Fetch Details. This would read from the URL file and extract item details.

This allowed me to fetch URLs first and estimate number of threads that i might need for second crawler.

Since each crawler was working on specific page format, there were quite a few functions I could reuse.

Upvotes: 1

Related Questions