David
David

Reputation: 2721

how to access commandline parameter in a crawlspider in scrapy?

I want to pass a parameter in scrapy crawl ... command line to be used in the rule definition in the extended CrawlSpider, like the following

name = 'example.com'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com']

rules = (
    # Extract links matching 'category.php' (but not matching 'subsection.php')
    # and follow links from them (since no callback means follow=True by default).
    Rule(SgmlLinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),

    # Extract links matching 'item.php' and parse them with the spider's method parse_item
    Rule(SgmlLinkExtractor(allow=('item\.php', )), callback='parse_item'),
)

I want that the allow attribute in the SgmlLinkExtractor is specified in the command line parameter. I have googled and found that I can get the parameter value in the spider's __init__ method, but how can I get the parameter in the command line to be used in the Rule definition?

Upvotes: 4

Views: 918

Answers (1)

paul trmbrth
paul trmbrth

Reputation: 20748

You can build your Spider's rules attribute in the __init__ method, something like:

class MySpider(CrawlSpider):

    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com']

    def __init__(self, allow=None, *args, **kwargs):
        self.rules = (
            Rule(SgmlLinkExtractor(allow=(self.allow,),)),
        )
        super(MySpider, self).__init__(*args, **kwargs)

And you pass the allow attribute on the command line like this:

scrapy crawl example.com -a allow="item\.php"

Upvotes: 5

Related Questions