Echchama Nayak
Echchama Nayak

Reputation: 933

Making a spider restarable

My intention is to scrape a few urls using a spider such as follows:

import scrapy
from ..items import ContentsPageSFBItem

class BasicSpider(scrapy.Spider):
    name = "contentspage_sfb"
    #allowed_domains = ["web"]
    start_urls = [
        'https://www.safaribooksonline.com/library/view/shell-programming-in/9780134496696/',
        'https://www.safaribooksonline.com/library/view/cisa-certified-information/9780134677453/'
    ]

    def parse(self, response):
            item = ContentsPageSFBItem()

            #from scrapy.shell import inspect_response
            #inspect_response(response, self)

            content_items = response.xpath('//ol[@class="detail-toc"]//a/text()').extract()

            for content_item in content_items:

                item['content_item'] = content_item
                item["full_url"] = response.url
                item['title'] = response.xpath('//title[1]/text()').extract()

                yield item

I intend to use more urls. My intention is to create a restartable spider incase something goes wrong. My plan is to add exceptions and create a csv with the list of remaining urls. Where exactly can I add this functionality?

Upvotes: 0

Views: 59

Answers (1)

Sebastián Palma
Sebastián Palma

Reputation: 33420

You could store the current url in which has occurred such problem and then pass it in an scrapy.Request using the same parse function to continue.

You can see if something has been printed in the website being visited using the response.body, is something bad has happened then yield a new scrapy.Request if isn't then continue as normal.

Maybe:

def parse(self, response):
    current_url = response.request.url
    if 'Some or none message in the body' in response.body:
        yield scrapy.Request(current_url, callback=self.parse)
    else:
        item = ContentsPageSFBItem()
        content_items = response.xpath('//ol[@class="detail-toc"]//a/text()').extract()

        for content_item in content_items:
            item['content_item'] = content_item
            item['full_url']     = response.url
            item['title']        = response.xpath('//title[1]/text()').extract()
            yield item

Note that the way you use again the parse function depends heavily on which "exception" you want to catch.

Keeping in mind you want to write the data to different files, depending on the url you're, then I've tweaked a little bit the code:

For first creating three global variables to store the first and second url, and the fields as an array. Nnote this would be useful for those 2 urls but if they start growing that would be difficult:

global first_url, second_url, fields
fields = []
first_url = 'https://www.safaribooksonline.com/library/view/shell-programming-in/9780134496696/'
second_url = 'https://www.safaribooksonline.com/library/view/cisa-certified-information/9780134677453/'
start_urls = [first_url, second_url]

Then within your parse function you get the data and store it in the fields array, which will be passed to a second function parse_and_write_csv, to create and write on every file depending on the current url.

def parse(self, response):
    item = ContentsPageSFBItem()
    content_items = response.xpath('//ol[@class="detail-toc"]//a/text()').extract()
    url = response.request.url

    for content_item in content_items:

        item['content_item'] = content_item
        item['full_url'] = response.url
        item['title'] = response.xpath('//title[1]/text()').extract()

        fields = [item['content_item'].encode('utf-8'), item['full_url'], item['title'][0]]

        self.parse_and_write_csv(response, fields)

The parse_and_write_csv get the fields and depending on the url it gets the 5th element from an array created from the url and creates a csv file or opens it if it already exists.

def parse_and_write_csv(self, response, fields):
    with open("%s.csv" % response.request.url.split('/')[5], 'a+') as file:
        file.write("{}\n".format(';'.join(str(field) 
                                      for field in fields)))

Hope it helps. You can see a gist here.

Upvotes: 1

Related Questions