Empty list with scrapy and Xpath

Question

I'm starting to use scrapy and xpath to scrape some page, I'm just trying simple things using ipython, an I get response in some pages like in IMDB, but when I try in others like www.bbb.org I always get an empty list. This is what I'm doing:

scrapy shell 'http://www.bbb.org/central-western-massachusetts/business-reviews/auto-repair-and-service/toms-automotive-in-fitchburg-ma-211787'

BBB Accreditation

A BBB Accredited Business since 02/12/2010

BBB has determined that Tom's Automotive meets BBB accreditation standards, which include a commitment to......"

the xpath of this paragraph is:

'//*[@id="business-accreditation-content"]/p[2]'

So I use:

data = response.xpath('//*[@id="business-accreditation-content"]/p[2]').extract()

But data is an empty list, I'm getting the Xpath with chrome and it works in other pages, but here I get nothing regardless what part of the page I try.

alecxe · Accepted Answer

The website actually checks for the User-Agent header.

See what it returns if you don't specify it:

$ scrapy shell 'http://www.bbb.org/central-western-massachusetts/business-reviews/auto-repair-and-service/toms-automotive-in-fitchburg-ma-211787'
In [1]: print(response.body)
Out[1]: 123

In [2]: response.xpath('//*[@id="business-accreditation-content"]/p[2]').extract()
Out[2]: []

Yes, that's right - the response contains only 123 if there is an unexpected request user agent.

Now with the header (note the specified -s command-line argument):

$ scrapy shell 'http://www.bbb.org/central-western-massachusetts/business-reviews/auto-repair-and-service/toms-automotive-in-fitchburg-ma-211787' -s USER_AGENT='Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36'
In [1]: response.xpath('//*[@id="business-accreditation-content"]/p[2]').extract()
Out[1]: [u'BBB has determined that Tom\'s Automotive meets BBB accreditation standards, which include a commitment to make a good faith effort to resolve any consumer complaints. BBB Accredited Businesses pay a fee for accreditation review/monitoring and for support of BBB services to the public.']

This was an example from the shell. In a real Scrapy project, you would need to set the USER_AGENT project setting. Or, you may also use user agent rotation with the help of this middleware: scrapy-fake-useragent.

Empty list with scrapy and Xpath

Answers (1)

Related Questions