Scrapy - Scrape page when redirected to captcha page

Question

# -*- coding: utf-8 -*-
import scrapy


class WayfairSpider(scrapy.Spider):
    name = 'wayfair'
    #allowed_domains = ['wayfair.com']
    start_urls = ['https://www.wayfair.com/appliances/pdp/zline-kitchen-and-bath-30-4-cu-ft-freestanding-gas-range-zlkn2652.html']

    def parse(self, response):

        #get top level item
        items = response.css('.PdpLayoutVariationB-infoBlock')
        for product in items:
            item = WayfairspiderItem()

        #get Price
            productPrice = product.css('.notranslate::text').getall()

            item['productPrice'] = productPrice
            yield item

The two images I posted show how I got the selectors that I'm using in my code. When running this Spider I expect to get the price of the item from the page, however I'm getting empty results. I tested response.css('.notranslate').getall() within Scrapy Shell and the output was []. I would appreciate if anyone could take a look and check my selectors!

Edit:

I believe my issue may actually be this:

When running my spider I get this:

2020-03-26 10:41:41 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to  from 
2020-03-26 10:41:41 [scrapy.core.engine] DEBUG: Crawled (200)  (referer: None)

It looks like I'm being redirected to the Captcha page so how would I be able to get around this or would this be one of those unsolvable problems?

Here's what I've Tried:

ROBOTSTXT_OBEY = False
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'

M Renlow · Accepted Answer

This is old but I wanted to help if someone stumbles on this later

What's happening here is that Wayfair has figured out that the request being made is coming from a robot. To get around this, in the Settings tab, you need to update your user agent and headers to imitate a browser.

This website has a pretty good general overview of what you need to do, and how to rotate these headers (which will likely be important as well, at scale):

https://www.scrapehero.com/how-to-fake-and-rotate-user-agents-using-python-3/

My advice would be that before you worry about rotating, get it to work once. You can do this easily in the settings tab of your Scrapy spider, where there is an option to add a User Agent and to add Default Header details. Once you update the settings tab and get things working for a single page scrape, then you can move to scaling it up, which will likely require rotation

Scrapy - Scrape page when redirected to captcha page

Answers (2)

Related Questions