Scrapy not giving individual results of all the reviews of a phone?

Question

This code is giving me results but the output is not as desired .what is wrong with my xpath? How to iterate the rule by +10. I have problem in these two always.

    import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from urlparse import urljoin


class CompItem(scrapy.Item):
    title = scrapy.Field()
    link = scrapy.Field()
    data = scrapy.Field()
    name_reviewer = scrapy.Field()
    date = scrapy.Field()
    model_name = scrapy.Field()
    rating = scrapy.Field()
    review = scrapy.Field()



class criticspider(CrawlSpider):
    name = "flip_review"
    allowed_domains = ["flipkart.com"]

    start_urls = ['http://www.flipkart.com/samsung-galaxy-s5/product-reviews/ITME5Z9GKXGMFSF6?pid=MOBDUUDTADHVQZXG&type=all']
    rules = (
        Rule(
            SgmlLinkExtractor(allow=('.*\&start=.*',)),
            callback="parse_start_url",
            follow=True),
    )

    def parse_start_url(self, response):
        sites = response.css('div.review-list div[review-id]')
        items = []
        model_name = response.xpath('//h1[@class="title"]/text()').re(r'Reviews of (.*?)$')
        for site in sites:
            item = CompItem()
            item['model_name'] = model_name
            item['name_reviewer'] = ''.join(site.xpath('.//div[contains(@class, "date")]/preceding-sibling::*[1]//text()').extract())
            item['date'] = site.xpath('.//div[contains(@class, "date")]/text()').extract()
            item['title'] = site.xpath('.//div[contains(@class,"line fk-font-normal bmargin5 dark-gray")]/strong/text()').extract()
            item['review'] = site.xpath('.//span[contains(@class,"review-text")]/text()').extract()
            yield item

My output is:

 {'date': [u'
 31 Mar 2015 ', u'
 23 Mar 2015 '],
  'model_name': [u'
 Reviews of A & K 333 '],
  'name_reviewer': [u'
 pradeep kumar', u'
 vikas agrawal']}

and I want my output to be :

{model_name :xyz
name_reviewer :abc
date:38383
}
{model_name :xyz
name_reviewer :hfhd
date:9283
}

I think the problem is with my XPath.

alecxe · Accepted Answer

First of all, your XPath expressions are very fragile in general.

The main problem with your approach is that site does not contain a review section, but it should. In other words, you are not iterating over review blocks on a page.

Also, the model name should be extracted outside of a loop since it is the same for every review on a page. I would also use .re() to extract the model name out of the title, e.g. SAMSUNG GALAXY S5 out of REVIEWS OF SAMSUNG GALAXY S5.

Here is the complete working code with fixes applied:

def parse_start_url(self, response):
    sites = response.css('div.review-list div[review-id]')

    model_name = response.xpath('//h1[@class="title"]/text()').re(r'Reviews of (.*?)$')[0].strip()
    for site in sites:
        item = CompItem()
        item['model_name'] = model_name
        item['name_reviewer'] = ''.join(site.xpath('.//div[contains(@class, "date")]/preceding-sibling::*[1]//text()').extract()).strip()
        item['date'] = site.xpath('.//div[contains(@class, "date")]/text()').extract()[0].strip()
        yield item

The XPath expressions are also made simpler. For the sake of an example, the review sections are identified by a CSS selector div.review-list div[review-id] that would match all div elements containing review-id attribute anywhere under the div having review-list class.

Also, note how name_reviewer is extracted - since there are different users, some of them are represented as a profile link, some are not registered and are located in the span with review-username class - I've taken a different approach: locating the review date and getting the first preceding sibling's text.

I'd like to point out that class names like line, fk-font-small, fk-font-11 etc are layout-oriented classes and are, generally speaking, not a good choice to rely your XPath expressions and CSS selectors on. Note, what classes are used to locate elements in the answer: review-list, title, date - they are more data-oriented and a better choice for your locators.

Scrapy not giving individual results of all the reviews of a phone?

Answers (2)

Related Questions