chrisHG
chrisHG

Reputation: 80

Scrapy - Scrape page when redirected to captcha page

# -*- coding: utf-8 -*-
import scrapy


class WayfairSpider(scrapy.Spider):
    name = 'wayfair'
    #allowed_domains = ['wayfair.com']
    start_urls = ['https://www.wayfair.com/appliances/pdp/zline-kitchen-and-bath-30-4-cu-ft-freestanding-gas-range-zlkn2652.html']

    def parse(self, response):

        #get top level item
        items = response.css('.PdpLayoutVariationB-infoBlock')
        for product in items:
            item = WayfairspiderItem()

        #get Price
            productPrice = product.css('.notranslate::text').getall()

            item['productPrice'] = productPrice
            yield item

enter image description here

enter image description here

The two images I posted show how I got the selectors that I'm using in my code. When running this Spider I expect to get the price of the item from the page, however I'm getting empty results. I tested response.css('.notranslate').getall() within Scrapy Shell and the output was []. I would appreciate if anyone could take a look and check my selectors!

Edit:

I believe my issue may actually be this:

When running my spider I get this:

2020-03-26 10:41:41 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://www.wayfair.com/v/captcha/show?goto=https%3A%2F%2Fwww.wayfair.com%2Fappliances%2Fpdp%2F-zlkn2652.html%3F&px=1&captcha_status=0> from <GET https://www.wayfair.com/appliances/pdp/-zlkn2652.html>
2020-03-26 10:41:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.wayfair.com/v/captcha/show?goto=https%3A%2F%2Fwww.wayfair.com%2Fappliances%2Fpdp%2F-zlkn2652.html%3F&px=1&captcha_status=0> (referer: None)

It looks like I'm being redirected to the Captcha page so how would I be able to get around this or would this be one of those unsolvable problems?

Here's what I've Tried:

ROBOTSTXT_OBEY = False
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'

Upvotes: 0

Views: 546

Answers (2)

M Renlow
M Renlow

Reputation: 36

This is old but I wanted to help if someone stumbles on this later

What's happening here is that Wayfair has figured out that the request being made is coming from a robot. To get around this, in the Settings tab, you need to update your user agent and headers to imitate a browser.

This website has a pretty good general overview of what you need to do, and how to rotate these headers (which will likely be important as well, at scale):

https://www.scrapehero.com/how-to-fake-and-rotate-user-agents-using-python-3/

My advice would be that before you worry about rotating, get it to work once. You can do this easily in the settings tab of your Scrapy spider, where there is an option to add a User Agent and to add Default Header details. Once you update the settings tab and get things working for a single page scrape, then you can move to scaling it up, which will likely require rotation

Upvotes: 1

Ali Nazari
Ali Nazari

Reputation: 1438

Since your start_url is a detail page you don't have to iterate through items.

Try this:

def parse(self, response):
    item = WayfairspiderItem()
    productPrice = product.css('.StandardPriceBlock .notranslate::text').get()
    item['productPrice'] = productPrice
    yield item

Upvotes: 0

Related Questions