Reputation: 5275
I am unable to crawl data, it shows 504 Gatway timeout error, I tried using the bypass method UserAgent and Proxy Both but does not help me to crawl data.
I tried scrapy-proxy-pool for proxy method and scrapy-user-agents for useragetn method but both method does not work.
getting 504 Gateway Time-out
my scrappy
import scrapy
import time
import random
class LaughfactorySpider(scrapy.Spider):
handle_httpstatus_list = [403, 504]
name = "myspider"
start_urls = ["mywebsitewebsite"]
def parse(self,response):
time.sleep(random.randint(0,4))
for site in response.xpath("//section[@class='test']/div/ul"):
item = {
'name': site.xpath("//li[@class='centr']//h2/span/text()").extract_first()
}
yield item
settings.py
###### For Dynamic Proxy
ROBOTSTXT_OBEY = True
ITEM_PIPELINES = {
'formsubmit_getresult.pipelines.FormsubmitGetresultPipeline': 300,
}
# To Enable Proxy
PROXY_POOL_ENABLED = True
DOWNLOADER_MIDDLEWARES = {
# ...
'scrapy_proxy_pool.middlewares.ProxyPoolMiddleware': 610,
'scrapy_proxy_pool.middlewares.BanDetectionMiddleware': 620,
# ...
}
####### For Dynamic USerAgent Middleware
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,
}
Upvotes: 0
Views: 712
Reputation: 4378
You are not correctly setting the User-Agent
header that's why website is giving you 504. You need to add User-Agent
header in the first request and all the subsequent requests.
Try something like this:
class LaughfactorySpider(scrapy.Spider):
handle_httpstatus_list = [403, 504]
name = "myspider"
start_urls = ["mywebsitewebsite"]
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.122 Safari/537.36'
}
def start_requests(self):
yield Request(self.start_urls[0], headers=self.headers)
def parse(self,response):
time.sleep(random.randint(0,4))
for site in response.xpath("//section[@class='test']/div/ul"):
item = {
'name': site.xpath("//li[@class='centr']//h2/span/text()").extract_first()
}
yield item
Hope it helps
Upvotes: 2