User New
User New

Reputation: 404

python scrapy collecting data from several pages into one item(dictionary)

I have a site to scrape. On the main page it has story teasers - so, this page will will be our start parsing page. My spider goes from it and collects data about every story - author, rating, publication date, etc. And this is done correctly by the spider.

import scrapy
from scrapy.spiders import Spider
from sxtl.items import SxtlItem
from scrapy.http.request import Request


class SxtlSpider(Spider):
    name = "sxtl"

    start_urls = ['some_site']


    def parse(self, response):

        list_of_stories = response.xpath('//div[@id and @class="storyBox"]')

        item = SxtlItem()

        for i in list_of_stories:

            pre_rating = i.xpath('div[@class="storyDetail"]/div[@class="stor\
                yDetailWrapper"]/div[@class="block rating_positive"]/span/\
                text()').extract()
            rating = float(("".join(pre_rating)).replace("+", ""))

            link = "".join(i.xpath('div[@class="wrapSLT"]/div[@class="title\
                Story"]/a/@href').extract())

            if rating > 6:
                yield Request("".join(link), meta={'item':item}, callback=\
                                                            self.parse_story)
            else:
                break

    def parse_story(self, response):

        item = response.meta['item']

        number_of_pages = response.xpath('//div[@class="pNavig"]/a[@href]\
                                        [last()-1]/text()').extract()

        if number_of_pages:
            item['number_of_pages'] = int("".join(number_of_pages))
        else:
            item['number_of_pages'] = 1

        item['date'] = "".join(response.xpath('//span[@class="date"]\
                                                /text()').extract()).strip()
        item['author'] = "".join(response.xpath('//a[@class="author"]\
                                                /text()').extract()).strip()
        item['text'] = response.xpath('//div[@id="storyText"]/div\
                [@itemprop="description"]/text() | //div[@id="storyText"]\
                        /div[@itemprop="description"]/p/text()').extract()
        item['list_of_links'] = response.xpath('//div[@class="pNavig"]\
                                            /a[@href]/@href').extract()

        yield item

So, the data is gathered correctly, BUT we have ONLY THE FIRST page of every story. But every sory has several pages (and has links to the 2nd, 3rd, 4th pages, sometimes 15 pages). That's where the problem rises. I replace yield item with this: (to get the 2nd page of every story)

yield Request("".join(item['list_of_links'][0]), meta={'item':item}, \
                                                callback=self.get_text)


def get_text(self, response):

    item = response.meta['item']

    item['text'].extend(response.xpath('//div[@id="storyText"]/div\
        [@itemprop="description"]/text() | //div[@id="storyText"]\
                /div[@itemprop="description"]/p/text()').extract())

    yield item

Spider collects next (2nd) pages, BUT it joins them to first page of ANY story. For example the 2nd page of 1st story may be added to the 4th story. The 2nd page of the 5th story is added to the 1st story. And so on.

Please help, how to collect data into one item (one dictionary) if data to be scraped is spread on several web pages? (In this case - how to not let data from different items to be mixed with each other?)

Thanks.

Upvotes: 1

Views: 1243

Answers (2)

User New
User New

Reputation: 404

After many attempts and reading of a whole bunch of documentation I found the solution:

item = SxtlItem()

This Item declaration should be moved from parse function to the beginning of parse_story function. And line "item = response.meta['item']" in parse_story should be deleted. And, of course,

yield Request("".join(link), meta={'item':item}, callback=self.parse_story)

in "parse" should be changed to

yield Request("".join(link), callback=self.parse_story)

Why? Because Item was declared only once and all it's fields were constantly being rewritten. While having only one page in the document - it looked as if everything ok and as if we have a "new" Item. But when a story has several pages, this Item is being overwritten in some chaotic ways, and we receive chaotic results. Shortly: New Item should be created as many times, as many item objects we are going to save.

After moving "item = SxtlItem()" to the right place everything works perfectly.

Upvotes: 1

Umair Ayub
Umair Ayub

Reputation: 21261

Non-technically speaking: -

1) Scrape story 1st page 2) Check if it has more pages or not 3) If not, just yield item 4) If it has Next Page button/link, scrape that link and also pass the entire dictionary of data onto next callback method.

def parse_story(self, response):

    item = response.meta['item']

    number_of_pages = response.xpath('//div[@class="pNavig"]/a[@href]\
                                    [last()-1]/text()').extract()

    if number_of_pages:
        item['number_of_pages'] = int("".join(number_of_pages))
    else:
        item['number_of_pages'] = 1

    item['date'] = "".join(response.xpath('//span[@class="date"]\
                                            /text()').extract()).strip()
    item['author'] = "".join(response.xpath('//a[@class="author"]\
                                            /text()').extract()).strip()
    item['text'] = response.xpath('//div[@id="storyText"]/div\
            [@itemprop="description"]/text() | //div[@id="storyText"]\
                    /div[@itemprop="description"]/p/text()').extract()
    item['list_of_links'] = response.xpath('//div[@class="pNavig"]\
                                        /a[@href]/@href').extract()

    # if it has NEXT PAGE button
    if nextPageURL > 0:
        yield Request(url= nextPageURL , callback=self.get_text, meta={'item':item})
    else:
        # it has no more pages, so just yield data.
        yield item





def get_text(self, response):

    item = response.meta['item']


    # merge text here
    item['text'] = item['text'] + response.xpath('//div[@id="storyText"]/div\
        [@itemprop="description"]/text() | //div[@id="storyText"]\
                /div[@itemprop="description"]/p/text()').extract()


    # Now again check here if it has NEXT PAGE button call same function again.
    if nextPageURL > 0:
        yield Request(url= nextPageURL , callback=self.get_text, meta={'item':item})
    else:
        # no more pages, now finally yield the ITEM
        yield item

Upvotes: 1

Related Questions