Scrapy crawler on Heroku returning 503 Service Unavailable

Question

I have a scrapy crawler that scrapes data off a website and uploads the scraped data to a remote MongoDB server. I wanted to host it on heroku to scrape automatically for a long time. I am using scrapy-user-agents to rotate between different user agents. When I use scrapy crawl locally on my pc, the spider runs correctly and returns data to the MongoDB database.

However, when I deploy the project on heroku, I get the following lines in my heroku logs :

2020-12-22T12:50:21.132731+00:00 app[web.1]: 2020-12-22 12:50:21 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying https://indiankanoon.org/browse/> (failed 1 times): 503 Service Unavailable

2020-12-22T12:50:21.134186+00:00 app[web.1]: 2020-12-22 12:50:21 [scrapy_user_agents.middlewares] DEBUG: Assigned User-Agent Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36

(it fails similarly for 9 times until:)

2020-12-22T12:50:23.594655+00:00 app[web.1]: 2020-12-22 12:50:23 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying https://indiankanoon.org/browse/> (failed 9 times): 503 Service Unavailable

2020-12-22T12:50:23.599310+00:00 app[web.1]: 2020-12-22 12:50:23 [scrapy.core.engine] DEBUG: Crawled (503) https://indiankanoon.org/browse/> (referer: None)

2020-12-22T12:50:23.701386+00:00 app[web.1]: 2020-12-22 12:50:23 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <503 https://indiankanoon.org/browse/>: HTTP status code is not handled or not allowed

2020-12-22T12:50:23.714834+00:00 app[web.1]: 2020-12-22 12:50:23 [scrapy.core.engine] INFO: Closing spider (finished)

In summary, my local IP address is able to scrape the data while when Heroku tries, it is unable to. Can changing something in the settings.py file correct it?

My settings.py file :

    BOT_NAME = 'indKanoon'
    
    SPIDER_MODULES = ['indKanoon.spiders']
    NEWSPIDER_MODULE = 'indKanoon.spiders'
    MONGO_URI = ''
    MONGO_DATABASE = 'casecounts'    
    ROBOTSTXT_OBEY = False
    CONCURRENT_REQUESTS = 32
    DOWNLOAD_DELAY = 3
    COOKIES_ENABLED = False
    DOWNLOADER_MIDDLEWARES = {
        'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
        'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,
    }
    ITEM_PIPELINES = {
   'indKanoon.pipelines.IndkanoonPipeline': 300,
}
    RETRY_ENABLED = True
    RETRY_TIMES = 8
    RETRY_HTTP_CODES = [500, 502, 503, 504, 522, 524, 408]

Avinash Karhana · Accepted Answer

It is probably due to DDoS protection or IP blacklisting by server you are trying to scrape from.

To overcome this situation you can use proxies.

I would recommend a middleware such as scrapy-proxies. Using this you can rotate, filter bad proxies or use a single proxy for your requests. Also, this will save you the trouble of setting up proxy everytime.

This is directly from the devs GitHub README (Github Link).

Install the scrapy-rotating-proxy library

pip install scrapy_proxies

In your settings.py add the following settings

# Retry many times since proxies often fail
RETRY_TIMES = 10
# Retry on most error codes since proxies fail for different reasons
RETRY_HTTP_CODES = [500, 503, 504, 400, 403, 404, 408]

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
    'scrapy_proxies.RandomProxy': 100,
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
}

# Proxy list containing entries like
# http://host1:port
# http://username:password@host2:port
# http://host3:port
# ...
PROXY_LIST = '/path/to/proxy/list.txt'

# Proxy mode
# 0 = Every requests have different proxy
# 1 = Take only one proxy from the list and assign it to every requests
# 2 = Put a custom proxy to use in the settings
PROXY_MODE = 0

# If proxy mode is 2 uncomment this sentence :
#CUSTOM_PROXY = "http://host1:port"

Here you can change retry times, set a single or rotating proxy

Then add your proxy to a list.txt file like this

http://host1:port
http://username:password@host2:port
http://host3:port

Using this all your requests will be sent through proxy which is rotated for every request randomly, so it will not affect concurrency.

Other similar middleware are also available like

scrapy-rotating-proxies

scrapy-proxies-tool

Scrapy crawler on Heroku returning 503 Service Unavailable

Answers (1)

Related Questions