rowele
rowele

Reputation: 95

How to work with a very large "allowed_domains" attribute in scrapy?

The following is my scrapy code:

def get_host_regex(self, spider):
    """Override this method to implement a different offsite policy"""
    allowed_domains = getattr(spider, 'allowed_domains', None)
    if not allowed_domains:
        return re.compile('') # allow all by default
    regex = r'^(.*\.)?(%s)$' % '|'.join(re.escape(d) for d in allowed_domains if d is not None)
    return re.compile(regex)

def spider_opened(self, spider):
        self.host_regex = self.get_host_regex(spider)
        self.domains_seen = set()

Because the allowed_domains is very big, it throws this exception:

regex = r'^(.*.)?(%s)$' % '|'.join(re.escape(d) for d in allowed_domains if d is not None)

How do I solve this problem?

Upvotes: 0

Views: 1237

Answers (1)

paul trmbrth
paul trmbrth

Reputation: 20748

You can build your own OffsiteMiddleware variation, with a different implementation checking requests to domains not in the spider's allowed_domains.

For example, add this in a middlewares.py file,

from scrapy.spidermiddlewares.offsite import OffsiteMiddleware
from scrapy.utils.httpobj import urlparse_cached


class SimpleOffsiteMiddleware(OffsiteMiddleware):

    def spider_opened(self, spider):
        # don't build a regex, just use the list as-is
        self.allowed_hosts = getattr(spider, 'allowed_domains', [])
        self.domains_seen = set()

    def should_follow(self, request, spider):
        if self.allowed_hosts:
            host = urlparse_cached(request).hostname or ''
            # does 'www.example.com' end with 'example.com'?
            # test this for all allowed domains
            return any([host.endswith(h) for h in self.allowed_hosts])
        else:
            return True

and change your settings to disable the default OffsiteMiddleware, and add yours:

SPIDER_MIDDLEWARES = {
    'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': None,
    'myproject.middlewares.SimpleOffsiteMiddleware': 500,
}

Warning: this middleware is not tested. This is a very naive implementation, definitely not very efficient (testing string inclusion for each of 50'000 possible domains for each and every request). You could use another backend to store the list and test a hostname value, like sqlite for example.

Upvotes: 2

Related Questions