Reputation: 434
I keep getting 403 error when using scrapy, even though I have proper headers set. The website, I am trying to scrape is https://steamdb.info/graph/.
My code:
def start_request(self):
headers = {"user-agent": "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.67 Mobile Safari/537.36",
"accept": "application/json",
"accept-encoding": "gzip, deflate, br",
"accept-language": "en-US,en;q=0.9,en-GB;q=0.8,ar;q=0.7",
"cache-control":" no-cache",
"pragma": "no-cache",
"referer": "https://steamdb.info/graph/",
"sec-fetch-dest": "empty",
"sec-fetch-mode": "cors",
"sec-fetch-site": "same-origin",
"x-requested-with": "XMLHttpRequest"
}
yield scrapy.Request(url = 'https://steamdb.info/graph', method='GET', headers = headers, callback=self.parse)
def parse(self, response):
#stuff to do
Error:
2022-07-08 20:20:41 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://steamdb.info/graph> (referer: https://steamdb.info/graph/)
2022-07-08 20:20:41 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://steamdb.info/graph>: HTTP status code is not handled or not allowed
Upvotes: 2
Views: 2302
Reputation: 16587
CloudScraper worked for me:
pip install cloudscraper
Then add middleware to your settings.py:
"DOWNLOADER_MIDDLEWARES": {
"YOUR_PATH.AntiBanMiddleware": 543
},
Here is the AntiBanMiddleware:
class AntiBanMiddleware:
cloudflare_scraper = cloudscraper.create_scraper()
def process_response(self, request, response, spider):
request_url = request.url
response_status = response.status
if response_status not in (403, 503):
return response
spider.logger.info("Cloudflare detected. Using cloudscraper on URL: %s", request_url)
cflare_response = self.cloudflare_scraper.get(request_url)
cflare_res_transformed = HtmlResponse(url=request_url, body=cflare_response.text, encoding='utf-8')
return cflare_res_transformed
Upvotes: 1
Reputation: 434
I solved it. If a website is using cloudfare, you can use undetected chrome driver and use it as a scrapy middleware
.
Add this to Middleware.py:
class SeleniumMiddleWare(object):
def __init__(self):
path = "G:/Downloads/chromedriver.exe"
options = uc.ChromeOptions()
options.headless=True
chrome_prefs = {}
options.experimental_options["prefs"] = chrome_prefs
chrome_prefs["profile.default_content_settings"] = {"images": 2}
chrome_prefs["profile.managed_default_content_settings"] = {"images": 2}
#self.driver = uc.Chrome(options=options)
self.driver= uc.Chrome(options= options, use_subprocess=True, driver_executable_path = path)
def process_request(self, request, spider):
try:
self.driver.get(request.url)
except:
pass
content = self.driver.page_source
self.driver.quit()
return HtmlResponse(request.url, encoding='utf-8', body=content, request=request)
def process_response(self, request, response, spider):
return response
Settings.py:
DOWNLOADER_MIDDLEWARES = {
'my_scraper.middlewares.SeleniumMiddleWare': 491 #change my_scraper to your scraper's name
}
my_scraper.py:
class SeleniumSpider(scrapy.Spider):
name = 'steamdb'
allowed_domains = ['steamdb.info']
start_urls = ['https://steamdb.info/graph/']
def parse(self, response):
yield {"title": response.css("h1::text").get()}
Upvotes: 2
Reputation: 16187
The website is under cloudflare protection.
https://steamdb.info/graph/ is using Cloudflare CDN/Proxy!
https://steamdb.info/graph/ is using Cloudflare SSL!
It's working with cloudscraper
which is equivalent to requests
module can handle cloudflare protection.
import cloudscraper
scraper = cloudscraper.create_scraper(delay=10, browser={'custom': 'ScraperBot/1.0',})
url = 'https://steamdb.info/graph/'
req = scraper.get(url)
print(req)
Output:
<Response [200]>
Upvotes: 2
Reputation: 119
This is because the site does not exist -https://steamdb.info/graphs/ goes to a 404
thanks
Upvotes: 0