maestromusica
maestromusica

Reputation: 706

Passing a request to a different spider

I'm working on a web crawler (using scrapy) that uses 2 different spiders:

  1. Very generic spider that can crawl (almost) any website using a bunch of heuristics to extract data.
  2. Specialized spider capable of crawling a particular website A that can't be crawled with a generic spider because of website's peculiar structure (that website has to be crawled).

Everything works nicely so far but website A contains links to other, "ordinary" websites that should be scraped too (using spider 1). Is there a Scrappy way to pass the request to spider 1?

Solutions I thought about:

  1. Moving all functionality to spider 1. But that might get really messy, spider 1 code is already very long and complicated, I'd like to keep this functionality separate, if possible.
  2. Saving the links to the database like it was suggested in Pass scraped URL's from one spider to another

Is there a better way?

Upvotes: 1

Views: 126

Answers (1)

Ando Jurai
Ando Jurai

Reputation: 1049

I met such a case, with a spyder retrieving in a first page the URL adresses and the second one being called from there to operate.
I don't know what is your control flow, but depending on it, I would merely call the first spyder just in time when scrapping a new url, or after scrapping all possible url.
Do you have the case where n°2 can retrieve URLs for the very same website? In this case, I would store all urls, sort them as list in a dict for either spider, and roll this again until there are not new element left to the lists to explore. That makes it better as it is more flexible, in my opinion.

Calling just in time might be ok, but depending on your flow, it could make performance poor as multiple calls to the same functions will probably lose lots of time initializing things.

You might also want to make analytical functions independent of the spider in order to make them available to both as you see fit. If your code is very long and complicated, it might help making it lighter and clearer. I know it is not always avoidable to do so, but that might be worth a try and you might end up being more efficient at code level.

Upvotes: 1

Related Questions