504 Gateway Time-out- with scrapy-proxy-pool and scrapy-user-agents

Question

I am unable to crawl data, it shows 504 Gatway timeout error, I tried using the bypass method UserAgent and Proxy Both but does not help me to crawl data.

I tried scrapy-proxy-pool for proxy method and scrapy-user-agents for useragetn method but both method does not work.

getting 504 Gateway Time-out

my scrappy

import scrapy
import time 
import random
class LaughfactorySpider(scrapy.Spider):
    handle_httpstatus_list = [403, 504]
    name = "myspider"
    start_urls = ["mywebsitewebsite"]

    def parse(self,response):
        time.sleep(random.randint(0,4))
        for site in response.xpath("//section[@class='test']/div/ul"):
            item = {
                'name': site.xpath("//li[@class='centr']//h2/span/text()").extract_first()
            }
            yield item

settings.py

###### For Dynamic Proxy

ROBOTSTXT_OBEY = True
ITEM_PIPELINES = {
   'formsubmit_getresult.pipelines.FormsubmitGetresultPipeline': 300,
}
# To Enable Proxy
PROXY_POOL_ENABLED = True
DOWNLOADER_MIDDLEWARES = {
    # ...
    'scrapy_proxy_pool.middlewares.ProxyPoolMiddleware': 610,
    'scrapy_proxy_pool.middlewares.BanDetectionMiddleware': 620,
    # ...
}

####### For Dynamic USerAgent Middleware
DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,
}

asimhashmi · Accepted Answer

You are not correctly setting the User-Agent header that's why website is giving you 504. You need to add User-Agent header in the first request and all the subsequent requests.

Try something like this:

class LaughfactorySpider(scrapy.Spider):
    handle_httpstatus_list = [403, 504]
    name = "myspider"
    start_urls = ["mywebsitewebsite"]

    headers = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.122 Safari/537.36'
    }

    def start_requests(self):
        yield Request(self.start_urls[0], headers=self.headers)

    def parse(self,response):
        time.sleep(random.randint(0,4))
        for site in response.xpath("//section[@class='test']/div/ul"):
            item = {
                'name': site.xpath("//li[@class='centr']//h2/span/text()").extract_first()
            }
            yield item

Hope it helps

504 Gateway Time-out- with scrapy-proxy-pool and scrapy-user-agents

Answers (1)

Related Questions