Reputation: 95
The following is my scrapy code:
def get_host_regex(self, spider):
"""Override this method to implement a different offsite policy"""
allowed_domains = getattr(spider, 'allowed_domains', None)
if not allowed_domains:
return re.compile('') # allow all by default
regex = r'^(.*\.)?(%s)$' % '|'.join(re.escape(d) for d in allowed_domains if d is not None)
return re.compile(regex)
def spider_opened(self, spider):
self.host_regex = self.get_host_regex(spider)
self.domains_seen = set()
Because the allowed_domains is very big, it throws this exception:
regex = r'^(.*.)?(%s)$' % '|'.join(re.escape(d) for d in allowed_domains if d is not None)
How do I solve this problem?
Upvotes: 0
Views: 1237
Reputation: 20748
You can build your own OffsiteMiddleware
variation, with a different implementation checking requests to domains not in the spider's allowed_domains
.
For example, add this in a middlewares.py
file,
from scrapy.spidermiddlewares.offsite import OffsiteMiddleware
from scrapy.utils.httpobj import urlparse_cached
class SimpleOffsiteMiddleware(OffsiteMiddleware):
def spider_opened(self, spider):
# don't build a regex, just use the list as-is
self.allowed_hosts = getattr(spider, 'allowed_domains', [])
self.domains_seen = set()
def should_follow(self, request, spider):
if self.allowed_hosts:
host = urlparse_cached(request).hostname or ''
# does 'www.example.com' end with 'example.com'?
# test this for all allowed domains
return any([host.endswith(h) for h in self.allowed_hosts])
else:
return True
and change your settings to disable the default OffsiteMiddleware, and add yours:
SPIDER_MIDDLEWARES = {
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': None,
'myproject.middlewares.SimpleOffsiteMiddleware': 500,
}
Warning: this middleware is not tested. This is a very naive implementation, definitely not very efficient (testing string inclusion for each of 50'000 possible domains for each and every request). You could use another backend to store the list and test a hostname value, like sqlite for example.
Upvotes: 2