Reputation: 57391
The following spider with fixed start_urls
works:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class NumberOfPagesSpider(CrawlSpider):
name = "number_of_pages"
allowed_domains = ["funda.nl"]
# def __init__(self, place='amsterdam'):
# self.start_urls = ["http://www.funda.nl/koop/%s/" % place]
start_urls = ["http://www.funda.nl/koop/amsterdam/"]
le_maxpage = LinkExtractor(allow=r'%s+p\d+' % start_urls[0])
rules = (Rule(le_maxpage, callback='get_max_page_number'),)
def get_max_page_number(self, response):
links = self.le_maxpage.extract_links(response)
max_page_number = 0 # Initialize the maximum page number
for link in links:
if link.url.count('/') == 6 and link.url.endswith('/'): # Select only pages with a link depth of 3
page_number = int(link.url.split("/")[-2].strip('p')) # For example, get the number 10 out of the string 'http://www.funda.nl/koop/amsterdam/p10/'
if page_number > max_page_number:
max_page_number = page_number # Update the maximum page number if the current value is larger than its previous value
filename = "max_pages.txt" # File name with as prefix the place name
with open(filename,'wb') as f:
f.write('max_page_number = %s' % max_page_number) # Write the maximum page number to a text file
If I run it by scrapy crawl number_of_pages
, it writes a .txt file as expected. However, if I modify it by commenting in the def __init__
lines and commenting out the start_urls =
line, and try to run it with a user-defined input argument,
scrapy crawl number_of_pages -a place=amsterdam
I get the following error:
le_maxpage = LinkExtractor(allow=r'%s+p\d+' % start_urls[0])
NameError: name 'start_urls' is not defined
So according to the spider, start_urls
is not defined, even though in the code it is fully determined in the initialization. How can I get this spider to work with start_urls
defined by an input argument?
Upvotes: 0
Views: 986
Reputation: 57391
Following masnun's answer, I managed to fix this. I list the updated code below for the sake of completeness.
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class NumberOfPagesSpider(CrawlSpider):
name = "number_of_pages"
allowed_domains = ["funda.nl"]
def __init__(self, place='amsterdam'):
self.start_urls = ["http://www.funda.nl/koop/%s/" % place]
self.le_maxpage = LinkExtractor(allow=r'%s+p\d+' % self.start_urls[0])
rules = (Rule(self.le_maxpage, ),)
def parse(self, response):
links = self.le_maxpage.extract_links(response)
max_page_number = 0 # Initialize the maximum page number
for link in links:
if link.url.count('/') == 6 and link.url.endswith('/'): # Select only pages with a link depth of 3
page_number = int(link.url.split("/")[-2].strip('p')) # For example, get the number 10 out of the string 'http://www.funda.nl/koop/amsterdam/p10/'
if page_number > max_page_number:
max_page_number = page_number # Update the maximum page number if the current value is larger than its previous value
filename = "max_pages.txt" # File name with as prefix the place name
with open(filename,'wb') as f:
f.write('max_page_number = %s' % max_page_number) # Write the maximum page number to a text file
Note that the Rule
does not even need a callback
because parse
is always called.
Upvotes: 2
Reputation: 11906
Your le_maxpage
is a class level variable. When you pass the argument to __init__
, you're creating an instance level variable start_urls
.
You used start_urls
in le_maxpage
, so for the le_maxpage
variable to work, there needs to be a class level variable named start_urls
.
To fix this issue, you need to move your class level variables to instance level, that is to define them inside the __init__
block.
Upvotes: 2