Having trouble accessing xpath attribute with scrapy

Question

I am currently trying to scrape the following url: http://www.bedbathandbeyond.com/store/product/dyson-dc59-motorhead-cordless-vacuum/1042997979?categoryId=10562

On this page, I want to extract the number of reviews listed. That is, I want to extract the number 693.

This is my current xpath:

sel.xpath('//*[@id="BVRRRatingSummaryLinkReadID"]/a/span/span')

It seems to be only returning an empty array, can someone suggest a correct xpath?

alecxe · Accepted Answer

There are no reviews on the initial page you are getting with Scrapy. The problem is that the reviews are loaded and constructed via the heavy use of javascript which makes things more complicated.

Basically, your options are:

a high-level approach (for example, use a real browser with selenium). You can even combine Scrapy and Selenium:
a middle-level approach: scrapy + scrapyjs
a low-level approach (find out where the reviews are constructed and get them)

Here is a working example of the low-level approach involving parsing of a javascript code with json and slimit, extracting HTML from it and parsing it via BeautifulSoup:

import json

from bs4 import BeautifulSoup
import requests
from slimit import ast
from slimit.parser import Parser
from slimit.visitors import nodevisitor

ID = 1042997979

url = 'http://bedbathandbeyond.ugc.bazaarvoice.com/2009-en_us/{id}/reviews.djs?format=embeddedhtml&sort=submissionTime'.format(id=ID)

response = requests.get(url)

parser = Parser()
tree = parser.parse(response.content)
data = ""
for node in nodevisitor.visit(tree):
    if isinstance(node, ast.Object):
        data = json.loads(node.to_ecma())
        if "BVRRSourceID" in data:
            break

soup = BeautifulSoup(data['BVRRSourceID'])
print soup.select('span.BVRRCount span.BVRRNumber')[0].text

Prints 693.

To adapt the solution to Scrapy, you would need to make a request with Scrapy instead of requests, and parse the HTML with Scrapy instead of BeautifulSoup.

Having trouble accessing xpath attribute with scrapy

Answers (2)

Related Questions