Difficulty formatting Scrapy output

Question

I'm a Python novice working on a Scrapy spider that is intended to retrieve all of the reviews from particular businesses on Yelp. This is my code so far, which mostly works:

from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy.http import Request
import re

# List of businesses to be crawled
RESTAURANTS = ['sixteen-chicago']

# Check number of reviews and create links to compensate for pagination
def createRestaurantPageLinks(self, response):
    reviewsPerPage = 40
    sel = Selector(response)
    totalReviews = int(sel.xpath('//div[@class="rating-info clearfix"]//span[@itemprop="reviewCount"]/text()').extract()[0].strip().split(' ')[0])
    pages = [Request(url=response.url + '?start=' + str(reviewsPerPage*(n+1)), callback=self.parse) for n in range(totalReviews/reviewsPerPage)]
    return pages

class YelpSpider(Spider):
    name = "yelp"
    allowed_domains = ["yelp.com"]
    start_urls = ['http://www.yelp.com/biz/%s' % s for s in RESTAURANTS]

    def parse(self, response):
        requests = []
        sel = Selector(response)
        reviews = sel.xpath('//div[@class="review-list"]')
        for review in reviews:
            venueName = sel.xpath('//meta[@property="og:title"]/@content').extract()
            reviewer = review.xpath('.//li[@class="user-name"]/a/text()').extract()
            reviewerLoc = review.xpath('.//li[@class="user-location"]/b/text()').extract()
            rating = review.xpath('.//div[@itemprop="review"]//meta[@itemprop="ratingValue"]/@content').extract()
            reviewDate = review.xpath('.//meta[@itemprop="datePublished"]/@content').extract()
            reviewText = review.xpath('.//p[@itemprop="description"]/text()').extract()
            print venueName, reviewer, reviewerLoc, reviewDate, rating, reviewText

        if response.url.find('?start=') == -1:
            requests += createRestaurantPageLinks(self, response)

        return requests

However, the output isn't what I expected. I anticipated something along the lines of this:

[u'venue name', u'reviewer', u'reviewer location', u'rating', u'review date', u'text of review']
[u'venue name', u'second reviewer', u'second reviewer location', u'second rating', u'second review date', u'second text of review']
[...]

But what I'm getting instead is every single instance of each variable on one row--all the reviewer names all alongside each other, all the review dates all alongside each other, etc. For example:

[u'Sharon C.', u'Steven N.', u'Michelle R.', u'Raven C.', u'Shelley M.', u'Kenneth S.', u'Liz L.', u'Allison B.', u'Valerie v.', u'Joy G.', u'Aleksandra W.', u'Jennifer J.', u'Emily M.', u'Danny G.', u'atima k.', u'Anna V.', u'Matt L.', u'Jay R.', u'Miss O.', u'Kathy O.', u'Happiness L.', u'Heidi J.', u'Maria A.', u'RD E.', u'Tom M.', u'Isaac G.', u'Michael P.', u'Mark P.', u'Stephanie P.', u'Jennifer L.', u'X X.', u'Erika H.', u'Ginger D.', u'Susan E.', u'Simone J.', u'Rick G.', u'Alia K.', u'Brent C.', u'Dan B.', u'Patricia H.']
[u'Hampshire, IL', u'Chicago, IL', u'Chicago, IL', u'Chicago, IL', u'Chicago, IL', u'Indian Head Park, IL', u'Evanston, IL', u'Chicago, IL', u'Chicago, IL', u'Clearwater, FL', u'Chicago, IL', u'Worth, IL', u'Chicago, IL', u'Indianapolis, IN', u'Halifax, Canada', u'Manhattan, NY', u'Chicago, IL', u'Chicago, IL', u'Wicker Park, Chicago, IL', u'Chicago, IL', u'Chicago, IL', u'Chicago, IL', u'Evanston, IL', u'Chicago, IL', u'Chicago, IL', u'Chicago, IL', u'Chicago, IL', u'Chicago, IL', u'San Diego, CA', u'Chicago, IL', u'Chicago, IL', u'Chicago, IL', u'Chicago, IL', u'Evanston, IL', u'Chicago, IL', u'Lisle, IL', u'Chicago, IL', u'Chicago, IL', u'Winnetka, IL', u'Torrance, CA']
[...]

I've tried exporting as items but I end up getting the same sort of result. I believe I may need some sort of seriation or something to facilitate what I want, but I've reached the end of my knowhow. Any help would be VERY appreciated!

J L · Accepted Answer

The script looks good, except for one thing: reviews is pointing to a

that is a wrapper for all of the reviews on the page, rather than each individual review. Thus, when ScraPy goes looking for //div[@class="review-list"], it gets back all of the reviews on the page at once. When it goes into the for loop, it only ends up having one item to iterate over. That one item contains all of the reviews on the page, so trying to get .//li[@class="user-name"]/a/text(), for example, ends up giving you every reviewer for the page all at once.

If you change reviews = sel.xpath('//div[@class="review-list"]') to reviews = sel.xpath('//div[@class="review review-with-no-actions"]'), you'll see what I mean (just from looking at the Yelp page for Sixteen Candles, I can see that each individual review is wrapped in a div with class review review-with-no-actions). With that change, reviews in your script becomes a list with one review per row, rather than all reviews in one row. The for loop now has a bunch of individual reviews to iterate over, such that when it goes looking for .//li[@class="user-name"]/a/text(), for example, in each iteration, it's only going to find one match (rather than all matches from the page).

Edit: tl;dr: I think that it's not a problem with the code, but rather with where you were pointing the code for Yelp's review page.

Difficulty formatting Scrapy output

Answers (1)

Related Questions