The Other Guy
The Other Guy

Reputation: 576

Crawl and monitor +1000 websites

I need help defining an architecture for a tool that will scrape more than 1000 big websites daily for new updates.

I'm planning to use Scrapy for this project:

Thanks!

Upvotes: 2

Views: 3803

Answers (2)

Shane Evans
Shane Evans

Reputation: 2254

Scrapy is an excellent choice for this project. See the documentation on broad crawls for some specific advice on crawling many (millions of) websites, but with only 1000 websites it's less important. You should only use a single project and a single spider - don't generate projects! Either don't define an allowed_domains attribute, or make sure it's limited to the set of domains currently being crawled. You may want to split the domains so that each process only crawls a subset, allowing you to parallelize the crawl.

Your spider will need to follow all links within the current domain, here is an example spider that follows all links, in case it helps. I am not sure what processing you'll want to do on the raw html. You may want to limit the depth or number of pages per site (e.g. using depth middleware).

Regarding revisiting websites, see the delatafetch middleware as an example of how to approach just fetching new URLs. Perhaps you can start with that and customize it.

Upvotes: 12

ChrisProsser
ChrisProsser

Reputation: 13088

I will be interested to see what other answers come up for this. I have done some web crawling / scrapping with code that I have written myself using urllib to get the html then just searching the html for what I need, but not tried scrapy yet.

I guess to see if there are differences you may just need to compare the previous and new html pages, but you would need to either work out what changes to ignore e.g. dates etc, or what specific changes you are looking for, unless there is an easier way to do this using scrapy.

On the storage front you could either store the html data just in a file system or look into just writting it to a database as strings. Just a local database like SQLite should work fine for this, but there are many other options.

Finally, I would also advise you to check out the terms on the sites you are planning to scrape and also check for guidance in the robots.txt if included within the html as some sites give guidance on how frequently they are happy for web crawlers to use them etc.

Upvotes: 0

Related Questions