Reputation: 48
I have a scrapy crawler that scrapes data off a website and uploads the scraped data to a remote MongoDB server. I wanted to host it on heroku to scrape automatically for a long time.
I am using scrapy-user-agents to rotate between different user agents.
When I use scrapy crawl <spider>
locally on my pc, the spider runs correctly and returns data to the MongoDB database.
However, when I deploy the project on heroku, I get the following lines in my heroku logs :
2020-12-22T12:50:21.132731+00:00 app[web.1]: 2020-12-22 12:50:21 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://indiankanoon.org/browse/> (failed 1 times): 503 Service Unavailable
2020-12-22T12:50:21.134186+00:00 app[web.1]: 2020-12-22 12:50:21 [scrapy_user_agents.middlewares] DEBUG: Assigned User-Agent Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36
(it fails similarly for 9 times until:)
2020-12-22T12:50:23.594655+00:00 app[web.1]: 2020-12-22 12:50:23 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://indiankanoon.org/browse/> (failed 9 times): 503 Service Unavailable
2020-12-22T12:50:23.599310+00:00 app[web.1]: 2020-12-22 12:50:23 [scrapy.core.engine] DEBUG: Crawled (503) <GET https://indiankanoon.org/browse/> (referer: None)
2020-12-22T12:50:23.701386+00:00 app[web.1]: 2020-12-22 12:50:23 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <503 https://indiankanoon.org/browse/>: HTTP status code is not handled or not allowed
2020-12-22T12:50:23.714834+00:00 app[web.1]: 2020-12-22 12:50:23 [scrapy.core.engine] INFO: Closing spider (finished)
In summary, my local IP address is able to scrape the data while when Heroku tries, it is unable to. Can changing something in the settings.py file correct it?
My settings.py file :
BOT_NAME = 'indKanoon'
SPIDER_MODULES = ['indKanoon.spiders']
NEWSPIDER_MODULE = 'indKanoon.spiders'
MONGO_URI = ''
MONGO_DATABASE = 'casecounts'
ROBOTSTXT_OBEY = False
CONCURRENT_REQUESTS = 32
DOWNLOAD_DELAY = 3
COOKIES_ENABLED = False
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,
}
ITEM_PIPELINES = {
'indKanoon.pipelines.IndkanoonPipeline': 300,
}
RETRY_ENABLED = True
RETRY_TIMES = 8
RETRY_HTTP_CODES = [500, 502, 503, 504, 522, 524, 408]
Upvotes: 1
Views: 1244
Reputation: 659
It is probably due to DDoS protection or IP blacklisting by server you are trying to scrape from.
To overcome this situation you can use proxies.
I would recommend a middleware such as scrapy-proxies. Using this you can rotate, filter bad proxies or use a single proxy for your requests. Also, this will save you the trouble of setting up proxy everytime.
This is directly from the devs GitHub README (Github Link).
Install the scrapy-rotating-proxy library
pip install scrapy_proxies
In your settings.py add the following settings
# Retry many times since proxies often fail
RETRY_TIMES = 10
# Retry on most error codes since proxies fail for different reasons
RETRY_HTTP_CODES = [500, 503, 504, 400, 403, 404, 408]
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
'scrapy_proxies.RandomProxy': 100,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
}
# Proxy list containing entries like
# http://host1:port
# http://username:password@host2:port
# http://host3:port
# ...
PROXY_LIST = '/path/to/proxy/list.txt'
# Proxy mode
# 0 = Every requests have different proxy
# 1 = Take only one proxy from the list and assign it to every requests
# 2 = Put a custom proxy to use in the settings
PROXY_MODE = 0
# If proxy mode is 2 uncomment this sentence :
#CUSTOM_PROXY = "http://host1:port"
Here you can change retry times, set a single or rotating proxy
Then add your proxy to a list.txt file like this
http://host1:port
http://username:password@host2:port
http://host3:port
Using this all your requests will be sent through proxy which is rotated for every request randomly, so it will not affect concurrency.
Other similar middleware are also available like
Upvotes: 2