Scrapy Crawler Process Setting

Question

I have built multiple crawlers and want to run them simultaneously using CrawlerProcess. However, when building the spiders I set it up so they would run a little slower and have a download delay. While running the spiders individually the settings work fine but when I run all four spiders its crawling very fast and a few of sites are kicking me off the network. What I would like to know is why doesn't CrawlerProcess follow the settings and if there is a way to make this happen how can I achieve that.

Here's how I have it setup:

TMP_FILE = os.path.join(os.path.dirname(sys.modules['items'].__file__), 'tmp/items.csv')

process = CrawlerProcess({
'FEED_FORMAT': 'csv',
'FEED_URI': TMP_FILE,
})
process.crawl(Spider1)
process.crawl(Spider2)
process.crawl(Spider3)
process.crawl(Spider4)
process.start()

eLRuLL · Accepted Answer

This happens because each spider is running individually without them knowing about each other.

Of course, all spiders are using the same settings, but that's the only connection.

The site must be complaining about multiple requests being done, maybe by the same origin proxy/IP so I would recommend maybe to use a proxy iterator service or to slow the spiders even more.

You can play with the following settings:

Scrapy Crawler Process Setting

Answers (2)

Related Questions