Reputation: 80
# -*- coding: utf-8 -*-
import scrapy
class WayfairSpider(scrapy.Spider):
name = 'wayfair'
#allowed_domains = ['wayfair.com']
start_urls = ['https://www.wayfair.com/appliances/pdp/zline-kitchen-and-bath-30-4-cu-ft-freestanding-gas-range-zlkn2652.html']
def parse(self, response):
#get top level item
items = response.css('.PdpLayoutVariationB-infoBlock')
for product in items:
item = WayfairspiderItem()
#get Price
productPrice = product.css('.notranslate::text').getall()
item['productPrice'] = productPrice
yield item
The two images I posted show how I got the selectors that I'm using in my code. When running this Spider I expect to get the price of the item from the page, however I'm getting empty results. I tested response.css('.notranslate').getall()
within Scrapy Shell and the output was []
.
I would appreciate if anyone could take a look and check my selectors!
Edit:
I believe my issue may actually be this:
When running my spider I get this:
2020-03-26 10:41:41 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://www.wayfair.com/v/captcha/show?goto=https%3A%2F%2Fwww.wayfair.com%2Fappliances%2Fpdp%2F-zlkn2652.html%3F&px=1&captcha_status=0> from <GET https://www.wayfair.com/appliances/pdp/-zlkn2652.html>
2020-03-26 10:41:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.wayfair.com/v/captcha/show?goto=https%3A%2F%2Fwww.wayfair.com%2Fappliances%2Fpdp%2F-zlkn2652.html%3F&px=1&captcha_status=0> (referer: None)
It looks like I'm being redirected to the Captcha page so how would I be able to get around this or would this be one of those unsolvable problems?
Here's what I've Tried:
ROBOTSTXT_OBEY = False
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'
Upvotes: 0
Views: 546
Reputation: 36
This is old but I wanted to help if someone stumbles on this later
What's happening here is that Wayfair has figured out that the request being made is coming from a robot. To get around this, in the Settings tab, you need to update your user agent and headers to imitate a browser.
This website has a pretty good general overview of what you need to do, and how to rotate these headers (which will likely be important as well, at scale):
https://www.scrapehero.com/how-to-fake-and-rotate-user-agents-using-python-3/
My advice would be that before you worry about rotating, get it to work once. You can do this easily in the settings tab of your Scrapy spider, where there is an option to add a User Agent and to add Default Header details. Once you update the settings tab and get things working for a single page scrape, then you can move to scaling it up, which will likely require rotation
Upvotes: 1
Reputation: 1438
Since your start_url is a detail page you don't have to iterate through items.
Try this:
def parse(self, response):
item = WayfairspiderItem()
productPrice = product.css('.StandardPriceBlock .notranslate::text').get()
item['productPrice'] = productPrice
yield item
Upvotes: 0