Reputation: 2721
I want to pass a parameter in scrapy crawl ...
command line to be used in the rule definition in the extended CrawlSpider, like the following
name = 'example.com'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com']
rules = (
# Extract links matching 'category.php' (but not matching 'subsection.php')
# and follow links from them (since no callback means follow=True by default).
Rule(SgmlLinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),
# Extract links matching 'item.php' and parse them with the spider's method parse_item
Rule(SgmlLinkExtractor(allow=('item\.php', )), callback='parse_item'),
)
I want that the allow attribute in the SgmlLinkExtractor is specified in the command line parameter.
I have googled and found that I can get the parameter value in the spider's __init__
method, but how can I get the parameter in the command line to be used in the Rule definition?
Upvotes: 4
Views: 918
Reputation: 20748
You can build your Spider's rules
attribute in the __init__
method, something like:
class MySpider(CrawlSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com']
def __init__(self, allow=None, *args, **kwargs):
self.rules = (
Rule(SgmlLinkExtractor(allow=(self.allow,),)),
)
super(MySpider, self).__init__(*args, **kwargs)
And you pass the allow
attribute on the command line like this:
scrapy crawl example.com -a allow="item\.php"
Upvotes: 5