Reputation: 570
This code is giving me results but the output is not as desired .what is wrong with my xpath? How to iterate the rule by +10. I have problem in these two always.
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from urlparse import urljoin
class CompItem(scrapy.Item):
title = scrapy.Field()
link = scrapy.Field()
data = scrapy.Field()
name_reviewer = scrapy.Field()
date = scrapy.Field()
model_name = scrapy.Field()
rating = scrapy.Field()
review = scrapy.Field()
class criticspider(CrawlSpider):
name = "flip_review"
allowed_domains = ["flipkart.com"]
start_urls = ['http://www.flipkart.com/samsung-galaxy-s5/product-reviews/ITME5Z9GKXGMFSF6?pid=MOBDUUDTADHVQZXG&type=all']
rules = (
Rule(
SgmlLinkExtractor(allow=('.*\&start=.*',)),
callback="parse_start_url",
follow=True),
)
def parse_start_url(self, response):
sites = response.css('div.review-list div[review-id]')
items = []
model_name = response.xpath('//h1[@class="title"]/text()').re(r'Reviews of (.*?)$')
for site in sites:
item = CompItem()
item['model_name'] = model_name
item['name_reviewer'] = ''.join(site.xpath('.//div[contains(@class, "date")]/preceding-sibling::*[1]//text()').extract())
item['date'] = site.xpath('.//div[contains(@class, "date")]/text()').extract()
item['title'] = site.xpath('.//div[contains(@class,"line fk-font-normal bmargin5 dark-gray")]/strong/text()').extract()
item['review'] = site.xpath('.//span[contains(@class,"review-text")]/text()').extract()
yield item
My output is:
{'date': [u'\n 31 Mar 2015 ', u'\n 23 Mar 2015 '],
'model_name': [u'\n Reviews of A & K 333 '],
'name_reviewer': [u'\n pradeep kumar', u'\n vikas agrawal']}
and I want my output to be :
{model_name :xyz
name_reviewer :abc
date:38383
}
{model_name :xyz
name_reviewer :hfhd
date:9283
}
I think the problem is with my XPath.
Upvotes: 1
Views: 213
Reputation: 474141
First of all, your XPath expressions are very fragile in general.
The main problem with your approach is that site
does not contain a review section, but it should. In other words, you are not iterating over review blocks on a page.
Also, the model name should be extracted outside of a loop since it is the same for every review on a page. I would also use .re()
to extract the model name out of the title, e.g. SAMSUNG GALAXY S5
out of REVIEWS OF SAMSUNG GALAXY S5
.
Here is the complete working code with fixes applied:
def parse_start_url(self, response):
sites = response.css('div.review-list div[review-id]')
model_name = response.xpath('//h1[@class="title"]/text()').re(r'Reviews of (.*?)$')[0].strip()
for site in sites:
item = CompItem()
item['model_name'] = model_name
item['name_reviewer'] = ''.join(site.xpath('.//div[contains(@class, "date")]/preceding-sibling::*[1]//text()').extract()).strip()
item['date'] = site.xpath('.//div[contains(@class, "date")]/text()').extract()[0].strip()
yield item
The XPath expressions are also made simpler. For the sake of an example, the review sections are identified by a CSS selector div.review-list div[review-id]
that would match all div
elements containing review-id
attribute anywhere under the div
having review-list
class.
Also, note how name_reviewer
is extracted - since there are different users, some of them are represented as a profile link, some are not registered and are located in the span
with review-username
class - I've taken a different approach: locating the review date and getting the first preceding sibling's text.
I'd like to point out that class names like line
, fk-font-small
, fk-font-11
etc are layout-oriented classes and are, generally speaking, not a good choice to rely your XPath expressions and CSS selectors on. Note, what classes are used to locate elements in the answer: review-list
, title
, date
- they are more data-oriented and a better choice for your locators.
Upvotes: 1
Reputation: 1712
this should help, its the problem with your xpath
,
In [1]: data_list = []
In [2]: sites = response.xpath('//div[@class="review-list"]/div')
In [3]: for site in sites:
data = {}
data['name_reviewer'] = site.xpath('./div/div[@class="line"]/span[@class="fk-color-title fk-font-11 review-username"]/text()|./div/div[@class="line"]/a[@class="load-user-widget fk-underline"]/text()').extract()[0].strip()
data['date'] = site.xpath('./div/div[@class="date line fk-font-small"]/text()').extract()[0].strip()
data['model_name'] = response.xpath('//h1[@class="title"]/text()').extract()[0].strip()
data_list.append(data)
In [4]: data_list
Out[4]:
[{'date': u'10 Apr 2014',
'model_name': u'Reviews of Samsung Galaxy S5',
'name_reviewer': u'RISHABH GROVER'},
{'date': u'11 Apr 2014',
'model_name': u'Reviews of Samsung Galaxy S5',
'name_reviewer': u'Hemraj Chaudhari'},
{'date': u'28 Apr 2014',
'model_name': u'Reviews of Samsung Galaxy S5',
'name_reviewer': u'RISHABH GROVER'},
{'date': u'27 Apr 2014',
'model_name': u'Reviews of Samsung Galaxy S5',
'name_reviewer': u'Debadutta Patnaik'},
{'date': u'24 May 2014',
'model_name': u'Reviews of Samsung Galaxy S5',
'name_reviewer': u'Joel'},
{'date': u'11 Apr 2014',
'model_name': u'Reviews of Samsung Galaxy S5',
'name_reviewer': u'Saswat Nayak'},
{'date': u'14 Apr 2014',
'model_name': u'Reviews of Samsung Galaxy S5',
'name_reviewer': u'Amit Thakor'},
{'date': u'28 May 2014',
'model_name': u'Reviews of Samsung Galaxy S5',
'name_reviewer': u'Nishchal Sharma'},
{'date': u'13 May 2014',
'model_name': u'Reviews of Samsung Galaxy S5',
'name_reviewer': u'siddiq hassan'},
{'date': u'16 May 2014',
'model_name': u'Reviews of Samsung Galaxy S5',
'name_reviewer': u'Raja Shekhar'}]
Upvotes: 1