Antoine Brunel
Antoine Brunel

Reputation: 1105

How to dynamically set Scrapy rules?

I have a class running some code before the init:

class NoFollowSpider(CrawlSpider):
    rules = ( Rule (SgmlLinkExtractor(allow=("", ),),
                callback="parse_items",  follow= True),
)

def __init__(self, moreparams=None, *args, **kwargs):
    super(NoFollowSpider, self).__init__(*args, **kwargs)
    self.moreparams = moreparams

I am running this scrapy code with the following command:

> scrapy runspider my_spider.py -a moreparams="more parameters" -o output.txt 

Now, I want the static variable named rules to be configurable from the command-line:

> scrapy runspider my_spider.py -a crawl=True -a moreparams="more parameters" -o output.txt

changing the init to:

def __init__(self, crawl_pages=False, moreparams=None, *args, **kwargs):
    if (crawl_pages is True):
        self.rules = ( Rule (SgmlLinkExtractor(allow=("", ),), callback="parse_items",  follow= True),
    )
    self.moreparams = moreparams

However, if I switch the static variable rules within the init, scrapy does not take it into account anymore: It runs, but only crawls the given start_urls and not the whole domain. It seems that rules must be a static class variable.

So, How can I dynamically set a static variable?

Upvotes: 5

Views: 2826

Answers (6)

jayshilling
jayshilling

Reputation: 351

class NoFollowSpider(CrawlSpider):
    def __init__(self, crawl_pages=False, moreparams=None, *a, **kw):
        if (crawl_pages is True):
            NoFollowSpider.rules = ( Rule (SgmlLinkExtractor(allow=("", ),),
                                           callback="parse_items",  follow= True),)

        # No need to call "_compile_rules()" manually, it's called in __init__ of the parent
        super(NoFollowSpider, self).__init__(*a, **kw)

        # Keep going as before
        self.moreparams = moreparams

Upvotes: 1

kvnn
kvnn

Reputation: 350

I'm doing this with Scrapy 1.0, and it works. Notice that you can only trust kwargs on the initial Spider instantiation.

    class LinuxFoundationSpider(CrawlSpider):
        year = None

        def __init__(self, category=None, *args, **kwargs):
           monthly_thread_xpath = 'date\.html'
        if kwargs.get('year'):
            LinuxFoundationSpider.year = kwargs['year']
        if LinuxFoundationSpider.year:
            monthly_thread_xpath = '%s.*?(\\/date\\.html)' % LinuxFoundationSpider.year

        LinuxFoundationSpider.rules = (
            Rule(LinkExtractor(allow=(monthly_thread_xpath,))),
            Rule(LinkExtractor(restrict_xpaths=('//ul[2]/li/a[1]',)),
                               callback='parse_entry', follow=False),
        )
    super(LinuxFoundationSpider, self).__init__(*args, **kwargs)

Upvotes: 0

Rick
Rick

Reputation: 45241

How can I dynamically set a static variable?

I don't know scrapy, but is there any reason you can't just use a class method?

class NoFollowSpider(CrawlSpider):
    rules = ( Rule (SgmlLinkExtractor(allow=("", ),),\
            callback="parse_items",  follow= True),)
    @classmethod
    def set_rules(klass,rules)
        klass.rules = rules

Note that rules isn't a static variable, it's a class attribute.


EDIT - Here's an alternate way to potentially set it at the very beginning. Should allow you to avoid having to do _compile_rules(), and I think it's a lot cleaner:

class NoFollowSpider(CrawlSpider):
    def __new__(klass, crawl_pages=False, moreparams=None, *args, **kwargs):
        if crawl_pages:
            klass.rules = ( Rule (SgmlLinkExtractor(allow=("", ),),\
            callback="parse_items",  follow= True),)
        return super(NoFollowSpider,klass).__new__(klass,*args,**kwargs)
    def __init__(self, crawl_pages=False, moreparams=None, *args, **kwargs):
        super(NoFollowSpider, self).__init__(*args, **kwargs)
        self.moreparams = moreparams

Upvotes: 0

Antoine Brunel
Antoine Brunel

Reputation: 1105

So here is how I resolved the problem with the great help of @Not_a_Golfer and @nramirezuy, I'm simply using a bit of both what they suggested:

class NoFollowSpider(CrawlSpider):

def __init__(self, crawl_pages=False, moreparams=None, *args, **kwargs):
    super(NoFollowSpider, self).__init__(*args, **kwargs)
    # Set the class member from here
    if (crawl_pages is True):
        NoFollowSpider.rules = ( Rule (SgmlLinkExtractor(allow=("", ),), callback="parse_items",  follow= True),)
        # Then recompile the Rules
        super(NoFollowSpider, self)._compile_rules()

    # Keep going as before
    self.moreparams = moreparams

Thank you all for your help!

Upvotes: 10

nramirezuy
nramirezuy

Reputation: 171

Rules are being compiled before you define them.

Upvotes: 1

Not_a_Golfer
Not_a_Golfer

Reputation: 49187

Well, you have two options. The simpler one - I'm not sure it will work but simply using the class instead of self in the constructor to set the rules:

def __init__(self, session_id=-1, crawl_pages=False, allowed_domains=None, start_urls=None, xpath=None, contains = None, doesnotcontain=None, *args, **kwargs):

    #You simply set the class member from here
    NoFollowSpider.rules = ( Rule (SgmlLinkExtractor(allow=("", ),),
                callback="parse_items",  follow= True),)

I'm not sure if scrapy will respect that - it depends on when it reads those rules. But worth a try.

Another, more complicated method, is using meta classes. Basically, you can intervene in the way the class is created, not only its instances. Note that the metaclass' __new__ methods happens on import time, before any code is run.

class MyType(type):
    """
    A Meta class that creates classes 
    """
    @staticmethod
    def __new__(cls, name, bases, dict):
        ret = type.__new__(cls, name, bases, dict)

        # whatever you want to do - do it here. You can peek into
        # the command line args for example
        ret.rules = (....)
        return ret


class MyClass(object):
    """
    Now comes the actual class, with the __metaclass__ identifier.
    This means that when we create the class definition we call the metaclass' __new__
    """ 
    __metaclass__ = MyType

    def __init__(self):
        pass

Upvotes: 2

Related Questions