Reputation: 375
I'd like to hear the diffrences between 3 different approaches for using Scrapy in order to crawl 1000 sites.
For example, I want to scrape 1000 photo sites, they all most has the same structure.Like have one kind of photo list page,and other kind of big photo page; but these list or photo desc page's HTML code will not all the same.
Another example,I want to scrape 1000 wordpress blog,Only bolg's article.
What are the diffrences, and which do you think is the right approach? Is there any other, better approach I've missed?
Upvotes: 0
Views: 342
Reputation: 141
I had 90 sites to pull from so it wasn't great option to create one crawler per site. The idea was to be able to run in parallel. Also i had split this to pack similar page formats in one place.
So I ended up with 2 crawlers:
This allowed me to fetch URLs first and estimate number of threads that i might need for second crawler.
Since each crawler was working on specific page format, there were quite a few functions I could reuse.
Upvotes: 1