Reputation: 31
I need to scrape data from a list of domain given in Excel; The problem is that I need to scrape data from the original website (let's take for example : https://www.lepetitballon.com) and data from similartech (https://www.similartech.com/websites/lepetitballon.com).
I want them to scrape at the same time so I could receive them and format them once at the end, after that i'll just go to the next domain.
Theoretically, I should just use 2 spiders in an asynchronous way with scrapy?
Upvotes: 1
Views: 959
Reputation: 1142
Twisted networking library is used by the scrapy framework for its internal networking tasks, and the scrapy has provided to handle the concurrent requests in settings.
Explained here: https://docs.scrapy.org/en/latest/topics/settings.html#concurrent-requests
Or you could use multiple spider which are independent to each others which is already explained in scrapy docs, this might be what you are looking for.
By default, Scrapy runs a single spider per process when you run scrapy crawl. However, Scrapy supports running multiple spiders per process using the internal API.
https://docs.scrapy.org/en/latest/topics/practices.html#running-multiple-spiders-in-the-same-process
As per the efficiency you could choose either option A or B, this really depends upon your resources and requirements whereas option A can be good for lower resources with decent speed or option B can be ideal for better speed with higher resources consumption than option A.
Upvotes: 0
Reputation: 2224
Ideally you would want to keep spiders which scrape differently structured sites separate, that way your code will be a lot easier to maintain in the long run.
Theoretically, if, for some reason you MUST parse them in the same spider, you could just collect the URLs you want to scrape and based on the base path you could invoke different parser callback methods. That being said, I personally cannot think of a reason why you would have to do that. Even if you would have the same structure, you can just reuse your scrapy.Item
classes.
Upvotes: 1