Reputation: 410
I am currently trying to scrape the following url: http://www.bedbathandbeyond.com/store/product/dyson-dc59-motorhead-cordless-vacuum/1042997979?categoryId=10562
On this page, I want to extract the number of reviews listed. That is, I want to extract the number 693.
This is my current xpath:
sel.xpath('//*[@id="BVRRRatingSummaryLinkReadID"]/a/span/span')
It seems to be only returning an empty array, can someone suggest a correct xpath?
Upvotes: 1
Views: 230
Reputation: 474201
There are no reviews on the initial page you are getting with Scrapy. The problem is that the reviews are loaded and constructed via the heavy use of javascript which makes things more complicated.
Basically, your options are:
selenium
). You can even combine Scrapy and Selenium:
scrapy
+ scrapyjs
Here is a working example of the low-level approach involving parsing of a javascript code with json
and slimit
, extracting HTML from it and parsing it via BeautifulSoup
:
import json
from bs4 import BeautifulSoup
import requests
from slimit import ast
from slimit.parser import Parser
from slimit.visitors import nodevisitor
ID = 1042997979
url = 'http://bedbathandbeyond.ugc.bazaarvoice.com/2009-en_us/{id}/reviews.djs?format=embeddedhtml&sort=submissionTime'.format(id=ID)
response = requests.get(url)
parser = Parser()
tree = parser.parse(response.content)
data = ""
for node in nodevisitor.visit(tree):
if isinstance(node, ast.Object):
data = json.loads(node.to_ecma())
if "BVRRSourceID" in data:
break
soup = BeautifulSoup(data['BVRRSourceID'])
print soup.select('span.BVRRCount span.BVRRNumber')[0].text
Prints 693
.
To adapt the solution to Scrapy, you would need to make a request with Scrapy
instead of requests
, and parse the HTML with Scrapy
instead of BeautifulSoup
.
Upvotes: 4