Kurt Peek
Kurt Peek

Reputation: 57391

In scrapy, 'start_urls' not defined when passed as an input argument

The following spider with fixed start_urls works:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class NumberOfPagesSpider(CrawlSpider):
    name = "number_of_pages"
    allowed_domains = ["funda.nl"]

    # def __init__(self, place='amsterdam'):
    #     self.start_urls = ["http://www.funda.nl/koop/%s/" % place]

    start_urls = ["http://www.funda.nl/koop/amsterdam/"]

    le_maxpage = LinkExtractor(allow=r'%s+p\d+' % start_urls[0])

    rules = (Rule(le_maxpage, callback='get_max_page_number'),)

    def get_max_page_number(self, response):
        links = self.le_maxpage.extract_links(response)
        max_page_number = 0                                                 # Initialize the maximum page number
        for link in links:
            if link.url.count('/') == 6 and link.url.endswith('/'):         # Select only pages with a link depth of 3
                page_number = int(link.url.split("/")[-2].strip('p'))       # For example, get the number 10 out of the string 'http://www.funda.nl/koop/amsterdam/p10/'
                if page_number > max_page_number:
                    max_page_number = page_number                           # Update the maximum page number if the current value is larger than its previous value
        filename = "max_pages.txt"                         # File name with as prefix the place name
        with open(filename,'wb') as f:
            f.write('max_page_number = %s' % max_page_number)               # Write the maximum page number to a text file

If I run it by scrapy crawl number_of_pages, it writes a .txt file as expected. However, if I modify it by commenting in the def __init__ lines and commenting out the start_urls = line, and try to run it with a user-defined input argument,

scrapy crawl number_of_pages -a place=amsterdam

I get the following error:

le_maxpage = LinkExtractor(allow=r'%s+p\d+' % start_urls[0])
NameError: name 'start_urls' is not defined

So according to the spider, start_urls is not defined, even though in the code it is fully determined in the initialization. How can I get this spider to work with start_urls defined by an input argument?

Upvotes: 0

Views: 986

Answers (2)

Kurt Peek
Kurt Peek

Reputation: 57391

Following masnun's answer, I managed to fix this. I list the updated code below for the sake of completeness.

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class NumberOfPagesSpider(CrawlSpider):
    name = "number_of_pages"
    allowed_domains = ["funda.nl"]

    def __init__(self, place='amsterdam'):
        self.start_urls = ["http://www.funda.nl/koop/%s/" % place]
        self.le_maxpage = LinkExtractor(allow=r'%s+p\d+' % self.start_urls[0])
        rules = (Rule(self.le_maxpage, ),)

    def parse(self, response):
        links = self.le_maxpage.extract_links(response)
        max_page_number = 0                                                 # Initialize the maximum page number
        for link in links:
            if link.url.count('/') == 6 and link.url.endswith('/'):         # Select only pages with a link depth of 3
                page_number = int(link.url.split("/")[-2].strip('p'))       # For example, get the number 10 out of the string 'http://www.funda.nl/koop/amsterdam/p10/'
                if page_number > max_page_number:
                    max_page_number = page_number                           # Update the maximum page number if the current value is larger than its previous value
        filename = "max_pages.txt"                         # File name with as prefix the place name
        with open(filename,'wb') as f:
            f.write('max_page_number = %s' % max_page_number)               # Write the maximum page number to a text file

Note that the Rule does not even need a callback because parse is always called.

Upvotes: 2

masnun
masnun

Reputation: 11906

Your le_maxpage is a class level variable. When you pass the argument to __init__, you're creating an instance level variable start_urls.

You used start_urls in le_maxpage, so for the le_maxpage variable to work, there needs to be a class level variable named start_urls.

To fix this issue, you need to move your class level variables to instance level, that is to define them inside the __init__ block.

Upvotes: 2

Related Questions