scrapy: restricting link extraction to the request domain

I have a scrapy project which uses a list of URLs from different domains as the seeds, but for any given page, I only want to follow links in the same domain as that page's URL (so the usual LinkExtractor(accept='example.com') approach wouldn't work. I'm surprised I couldn't find a solution on the web, as I'd expect this to be a common task. The best I could come up with was this in the spider file and refer to it in the Rules:

class CustomLinkExtractor(LinkExtractor):

    def get_domain(self, url):
        # https://stackoverflow.com/questions/9626535/get-protocol-host-name-from-url
        return '.'.join(tldextract.extract(url)[1:])


    def extract_links(self, response):
        domain = self.get_domain(response.url)
        # https://stackoverflow.com/questions/40701227/using-scrapy-linkextractor-to-locate-specific-domain-extensions
        return list(
            filter(
                lambda link: self.get_domain(link.url) == domain,
                super(CustomLinkExtractor, self).extract_links(response)
            )
        )

But that doesn't work (the spider goes off-domain).

Now I'm trying to use the process_request option in the Rule:

    rules = (
        Rule(LinkExtractor(deny_domains='twitter.com'),
             callback='parse_response',
             process_request='check_r_r_domains',
             follow=True,
             ),
    )

and

    def check_r_r_domains(request, response):
        domain0 = '.'.join(tldextract.extract(request.url)[1:])
        domain1 = '.'.join(tldextract.extract(response.url)[1:])
        log('TEST:', domain0, domain1)
        if (domain0 == domain1) and (domain0 != 'twitter.com'):
            return request
        log(domain0, ' != ', domain1)
        return None

but I get an exception because it's passing self to the method (the spider has no url attribute); when I add self to the method signature, I get an exception that the response positional argument is missing! If I change the callback to process_request=self.check_r_r_domains, I get an error because self isn't defined where I set the rules!

Upvotes: 1

Answers (2)

AdamF

Reputation: 619

Oops, it turns out that conda on the server I'm using had installed a 1.6 version of scrapy. I've forced it to install 1.8.0 from conda-forge and I think it's working now.

Upvotes: 0

Gallaecio

Reputation: 3847

If you are using Scrapy 1.7.0 or later, you can pass Rule a callable, process_request, to check the URLs of both the request and the response, and drop the request (return None) if the domains do not match.

Upvotes: 1

scrapy: restricting link extraction to the request domain

Answers (2)

Related Questions