Reputation: 706
I'm working on a web crawler (using scrapy) that uses 2 different spiders:
Everything works nicely so far but website A contains links to other, "ordinary" websites that should be scraped too (using spider 1). Is there a Scrappy way to pass the request to spider 1?
Solutions I thought about:
Is there a better way?
Upvotes: 1
Views: 126
Reputation: 1049
I met such a case, with a spyder retrieving in a first page the URL adresses and the second one being called from there to operate.
I don't know what is your control flow, but depending on it, I would merely call the first spyder just in time when scrapping a new url, or after scrapping all possible url.
Do you have the case where n°2 can retrieve URLs for the very same website? In this case, I would store all urls, sort them as list in a dict for either spider, and roll this again until there are not new element left to the lists to explore. That makes it better as it is more flexible, in my opinion.
Calling just in time might be ok, but depending on your flow, it could make performance poor as multiple calls to the same functions will probably lose lots of time initializing things.
You might also want to make analytical functions independent of the spider in order to make them available to both as you see fit. If your code is very long and complicated, it might help making it lighter and clearer. I know it is not always avoidable to do so, but that might be worth a try and you might end up being more efficient at code level.
Upvotes: 1