Reputation: 1105
I have a class running some code before the init:
class NoFollowSpider(CrawlSpider):
rules = ( Rule (SgmlLinkExtractor(allow=("", ),),
callback="parse_items", follow= True),
)
def __init__(self, moreparams=None, *args, **kwargs):
super(NoFollowSpider, self).__init__(*args, **kwargs)
self.moreparams = moreparams
I am running this scrapy code with the following command:
> scrapy runspider my_spider.py -a moreparams="more parameters" -o output.txt
Now, I want the static variable named rules to be configurable from the command-line:
> scrapy runspider my_spider.py -a crawl=True -a moreparams="more parameters" -o output.txt
changing the init to:
def __init__(self, crawl_pages=False, moreparams=None, *args, **kwargs):
if (crawl_pages is True):
self.rules = ( Rule (SgmlLinkExtractor(allow=("", ),), callback="parse_items", follow= True),
)
self.moreparams = moreparams
However, if I switch the static variable rules within the init, scrapy does not take it into account anymore: It runs, but only crawls the given start_urls and not the whole domain. It seems that rules must be a static class variable.
So, How can I dynamically set a static variable?
Upvotes: 5
Views: 2826
Reputation: 351
class NoFollowSpider(CrawlSpider):
def __init__(self, crawl_pages=False, moreparams=None, *a, **kw):
if (crawl_pages is True):
NoFollowSpider.rules = ( Rule (SgmlLinkExtractor(allow=("", ),),
callback="parse_items", follow= True),)
# No need to call "_compile_rules()" manually, it's called in __init__ of the parent
super(NoFollowSpider, self).__init__(*a, **kw)
# Keep going as before
self.moreparams = moreparams
Upvotes: 1
Reputation: 350
I'm doing this with Scrapy 1.0, and it works. Notice that you can only trust kwargs on the initial Spider instantiation.
class LinuxFoundationSpider(CrawlSpider):
year = None
def __init__(self, category=None, *args, **kwargs):
monthly_thread_xpath = 'date\.html'
if kwargs.get('year'):
LinuxFoundationSpider.year = kwargs['year']
if LinuxFoundationSpider.year:
monthly_thread_xpath = '%s.*?(\\/date\\.html)' % LinuxFoundationSpider.year
LinuxFoundationSpider.rules = (
Rule(LinkExtractor(allow=(monthly_thread_xpath,))),
Rule(LinkExtractor(restrict_xpaths=('//ul[2]/li/a[1]',)),
callback='parse_entry', follow=False),
)
super(LinuxFoundationSpider, self).__init__(*args, **kwargs)
Upvotes: 0
Reputation: 45241
How can I dynamically set a static variable?
I don't know scrapy, but is there any reason you can't just use a class method?
class NoFollowSpider(CrawlSpider):
rules = ( Rule (SgmlLinkExtractor(allow=("", ),),\
callback="parse_items", follow= True),)
@classmethod
def set_rules(klass,rules)
klass.rules = rules
Note that rules
isn't a static variable, it's a class attribute.
EDIT - Here's an alternate way to potentially set it at the very beginning. Should allow you to avoid having to do _compile_rules(),
and I think it's a lot cleaner:
class NoFollowSpider(CrawlSpider):
def __new__(klass, crawl_pages=False, moreparams=None, *args, **kwargs):
if crawl_pages:
klass.rules = ( Rule (SgmlLinkExtractor(allow=("", ),),\
callback="parse_items", follow= True),)
return super(NoFollowSpider,klass).__new__(klass,*args,**kwargs)
def __init__(self, crawl_pages=False, moreparams=None, *args, **kwargs):
super(NoFollowSpider, self).__init__(*args, **kwargs)
self.moreparams = moreparams
Upvotes: 0
Reputation: 1105
So here is how I resolved the problem with the great help of @Not_a_Golfer and @nramirezuy, I'm simply using a bit of both what they suggested:
class NoFollowSpider(CrawlSpider):
def __init__(self, crawl_pages=False, moreparams=None, *args, **kwargs):
super(NoFollowSpider, self).__init__(*args, **kwargs)
# Set the class member from here
if (crawl_pages is True):
NoFollowSpider.rules = ( Rule (SgmlLinkExtractor(allow=("", ),), callback="parse_items", follow= True),)
# Then recompile the Rules
super(NoFollowSpider, self)._compile_rules()
# Keep going as before
self.moreparams = moreparams
Thank you all for your help!
Upvotes: 10
Reputation: 49187
Well, you have two options. The simpler one - I'm not sure it will work but simply using the class instead of self
in the constructor to set the rules:
def __init__(self, session_id=-1, crawl_pages=False, allowed_domains=None, start_urls=None, xpath=None, contains = None, doesnotcontain=None, *args, **kwargs):
#You simply set the class member from here
NoFollowSpider.rules = ( Rule (SgmlLinkExtractor(allow=("", ),),
callback="parse_items", follow= True),)
I'm not sure if scrapy will respect that - it depends on when it reads those rules. But worth a try.
Another, more complicated method, is using meta classes. Basically, you can intervene in the way the class is created, not only its instances. Note that the metaclass' __new__
methods happens on import time, before any code is run.
class MyType(type):
"""
A Meta class that creates classes
"""
@staticmethod
def __new__(cls, name, bases, dict):
ret = type.__new__(cls, name, bases, dict)
# whatever you want to do - do it here. You can peek into
# the command line args for example
ret.rules = (....)
return ret
class MyClass(object):
"""
Now comes the actual class, with the __metaclass__ identifier.
This means that when we create the class definition we call the metaclass' __new__
"""
__metaclass__ = MyType
def __init__(self):
pass
Upvotes: 2