dcarlo56ave
dcarlo56ave

Reputation: 253

Scrapy Crawler Process Setting

I have built multiple crawlers and want to run them simultaneously using CrawlerProcess. However, when building the spiders I set it up so they would run a little slower and have a download delay. While running the spiders individually the settings work fine but when I run all four spiders its crawling very fast and a few of sites are kicking me off the network. What I would like to know is why doesn't CrawlerProcess follow the settings and if there is a way to make this happen how can I achieve that.

Here's how I have it setup:

TMP_FILE = os.path.join(os.path.dirname(sys.modules['items'].__file__), 'tmp/items.csv')

process = CrawlerProcess({
'FEED_FORMAT': 'csv',
'FEED_URI': TMP_FILE,
})
process.crawl(Spider1)
process.crawl(Spider2)
process.crawl(Spider3)
process.crawl(Spider4)
process.start()

Upvotes: 1

Views: 1435

Answers (2)

eLRuLL
eLRuLL

Reputation: 18799

This happens because each spider is running individually without them knowing about each other.

Of course, all spiders are using the same settings, but that's the only connection.

The site must be complaining about multiple requests being done, maybe by the same origin proxy/IP so I would recommend maybe to use a proxy iterator service or to slow the spiders even more.

You can play with the following settings:

Upvotes: 0

dcarlo56ave
dcarlo56ave

Reputation: 253

Fixed the issue by adding custom settings to each one of my spiders. You can add this right below the start urls list.

start_urls =['www.example.com']

custom_settings = {
    'DOWNLOAD_DELAY': 8,
    'CONCURRENT_REQUESTS': 1,
    'AUTOTHROTTLE_ENABLED': True,
    'AUTOTHROTTLE_START_DELAY': 5,

}

Upvotes: 1

Related Questions