Reputation: 404
I have a site to scrape. On the main page it has story teasers - so, this page will will be our start parsing page. My spider goes from it and collects data about every story - author, rating, publication date, etc. And this is done correctly by the spider.
import scrapy
from scrapy.spiders import Spider
from sxtl.items import SxtlItem
from scrapy.http.request import Request
class SxtlSpider(Spider):
name = "sxtl"
start_urls = ['some_site']
def parse(self, response):
list_of_stories = response.xpath('//div[@id and @class="storyBox"]')
item = SxtlItem()
for i in list_of_stories:
pre_rating = i.xpath('div[@class="storyDetail"]/div[@class="stor\
yDetailWrapper"]/div[@class="block rating_positive"]/span/\
text()').extract()
rating = float(("".join(pre_rating)).replace("+", ""))
link = "".join(i.xpath('div[@class="wrapSLT"]/div[@class="title\
Story"]/a/@href').extract())
if rating > 6:
yield Request("".join(link), meta={'item':item}, callback=\
self.parse_story)
else:
break
def parse_story(self, response):
item = response.meta['item']
number_of_pages = response.xpath('//div[@class="pNavig"]/a[@href]\
[last()-1]/text()').extract()
if number_of_pages:
item['number_of_pages'] = int("".join(number_of_pages))
else:
item['number_of_pages'] = 1
item['date'] = "".join(response.xpath('//span[@class="date"]\
/text()').extract()).strip()
item['author'] = "".join(response.xpath('//a[@class="author"]\
/text()').extract()).strip()
item['text'] = response.xpath('//div[@id="storyText"]/div\
[@itemprop="description"]/text() | //div[@id="storyText"]\
/div[@itemprop="description"]/p/text()').extract()
item['list_of_links'] = response.xpath('//div[@class="pNavig"]\
/a[@href]/@href').extract()
yield item
So, the data is gathered correctly, BUT we have ONLY THE FIRST page of every story. But every sory has several pages (and has links to the 2nd, 3rd, 4th pages, sometimes 15 pages). That's where the problem rises. I replace yield item with this: (to get the 2nd page of every story)
yield Request("".join(item['list_of_links'][0]), meta={'item':item}, \
callback=self.get_text)
def get_text(self, response):
item = response.meta['item']
item['text'].extend(response.xpath('//div[@id="storyText"]/div\
[@itemprop="description"]/text() | //div[@id="storyText"]\
/div[@itemprop="description"]/p/text()').extract())
yield item
Spider collects next (2nd) pages, BUT it joins them to first page of ANY story. For example the 2nd page of 1st story may be added to the 4th story. The 2nd page of the 5th story is added to the 1st story. And so on.
Please help, how to collect data into one item (one dictionary) if data to be scraped is spread on several web pages? (In this case - how to not let data from different items to be mixed with each other?)
Thanks.
Upvotes: 1
Views: 1243
Reputation: 404
After many attempts and reading of a whole bunch of documentation I found the solution:
item = SxtlItem()
This Item declaration should be moved from parse function to the beginning of parse_story function. And line "item = response.meta['item']" in parse_story should be deleted. And, of course,
yield Request("".join(link), meta={'item':item}, callback=self.parse_story)
in "parse" should be changed to
yield Request("".join(link), callback=self.parse_story)
Why? Because Item was declared only once and all it's fields were constantly being rewritten. While having only one page in the document - it looked as if everything ok and as if we have a "new" Item. But when a story has several pages, this Item is being overwritten in some chaotic ways, and we receive chaotic results. Shortly: New Item should be created as many times, as many item objects we are going to save.
After moving "item = SxtlItem()" to the right place everything works perfectly.
Upvotes: 1
Reputation: 21261
Non-technically speaking: -
1) Scrape story 1st page
2) Check if it has more pages or not
3) If not, just yield
item
4) If it has Next Page button/link, scrape that link and also pass the entire dictionary of data onto next callback method.
def parse_story(self, response):
item = response.meta['item']
number_of_pages = response.xpath('//div[@class="pNavig"]/a[@href]\
[last()-1]/text()').extract()
if number_of_pages:
item['number_of_pages'] = int("".join(number_of_pages))
else:
item['number_of_pages'] = 1
item['date'] = "".join(response.xpath('//span[@class="date"]\
/text()').extract()).strip()
item['author'] = "".join(response.xpath('//a[@class="author"]\
/text()').extract()).strip()
item['text'] = response.xpath('//div[@id="storyText"]/div\
[@itemprop="description"]/text() | //div[@id="storyText"]\
/div[@itemprop="description"]/p/text()').extract()
item['list_of_links'] = response.xpath('//div[@class="pNavig"]\
/a[@href]/@href').extract()
# if it has NEXT PAGE button
if nextPageURL > 0:
yield Request(url= nextPageURL , callback=self.get_text, meta={'item':item})
else:
# it has no more pages, so just yield data.
yield item
def get_text(self, response):
item = response.meta['item']
# merge text here
item['text'] = item['text'] + response.xpath('//div[@id="storyText"]/div\
[@itemprop="description"]/text() | //div[@id="storyText"]\
/div[@itemprop="description"]/p/text()').extract()
# Now again check here if it has NEXT PAGE button call same function again.
if nextPageURL > 0:
yield Request(url= nextPageURL , callback=self.get_text, meta={'item':item})
else:
# no more pages, now finally yield the ITEM
yield item
Upvotes: 1