Crawl and monitor +1000 websites

Question

I need help defining an architecture for a tool that will scrape more than 1000 big websites daily for new updates.

I'm planning to use Scrapy for this project:

Giving that Scrapy needs a project for each website, how can I handle scraping 1000+ websites and storing it's data with Scrapy in just one project? I tried adding a project generator, but is this a good idea?
How can I tell if a website was updated with new content so I can scrape it again?

Thanks!

Shane Evans · Accepted Answer

Scrapy is an excellent choice for this project. See the documentation on broad crawls for some specific advice on crawling many (millions of) websites, but with only 1000 websites it's less important. You should only use a single project and a single spider - don't generate projects! Either don't define an allowed_domains attribute, or make sure it's limited to the set of domains currently being crawled. You may want to split the domains so that each process only crawls a subset, allowing you to parallelize the crawl.

Your spider will need to follow all links within the current domain, here is an example spider that follows all links, in case it helps. I am not sure what processing you'll want to do on the raw html. You may want to limit the depth or number of pages per site (e.g. using depth middleware).

Regarding revisiting websites, see the delatafetch middleware as an example of how to approach just fetching new URLs. Perhaps you can start with that and customize it.

Crawl and monitor +1000 websites

Answers (2)

Related Questions