Reputation: 13
I have written the following spider for scraping the webmd site for patient reviews
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
class MySpider(BaseSpider):
name = "webmd"
allowed_domains = ["webmd.com"]
start_urls = ["http://www.webmd.com/drugs/drugreview-92884-Boniva"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
titles = hxs.select("//p")
title = titles.select("//p[contains(@class, 'comment')and contains(@style, 'display:none')]/text()").extract()
print(title)
Executing this code gives me desired output but with a lot of duplication i.e. the same comments are repeated for at least 10 times. Help me to solve this issue.
Upvotes: 1
Views: 890
Reputation: 1549
You can rewrite your spider code like this:
import scrapy
# Your Items
class ReviewItem(scrapy.Item):
review = scrapy.Field()
class WebmdSpider(scrapy.Spider):
name = "webmd"
allowed_domains = ["webmd.com"]
start_urls = ['http://www.webmd.com/drugs/drugreview-92884-Boniva']
def parse(self, response):
titles = response.xpath('//p[contains(@id, "Full")]')
for title in titles:
item = ReviewItem()
item['review'] = title.xpath('text()').extract_first()
yield item
# Checks if there is a next page link, and keeping parsing if True
next_page = response.xpath('(//a[contains(., "Next")])[1]/@href').extract_first()
if next_page:
yield scrapy.Request(response.urljoin(next_page), callback=self.parse)
It selects only full customer reviews without duplicate and saves them in Scrapy Items.
Note: instead of HtmlXPathSelector
you can use more convenient shortcut response
. Also, I change deprecated scrapy.BaseSpider
to scrapy.Spider
.
For saving reviews to a csv format you can simply use Scrapy Feed exports and type in console scrapy crawl webmd -o reviews.csv
.
Upvotes: 3
Reputation: 5648
You can use sets
to get unique comments. I hope you know that the selector returns the result as a list
so if you use sets then you ll get only unique results. So
def parse(self,response):
hxs = HtmlXPathSelector(response)
titles = hxs.select("//p")
title = set(titles.select("//p[contains(@class, 'comment')and contains(@style, 'display:none')]/text()").extract())
print (title) #this will have only unique results.
Upvotes: 2