Reputation: 619
I have a scrapy project which uses a list of URLs from different domains as the seeds, but for any given page, I only want to follow links in the same domain as that page's URL (so the usual LinkExtractor(accept='example.com')
approach wouldn't work. I'm surprised I couldn't find a solution on the web, as I'd expect this to be a common task. The best I could come up with was this in the spider file and refer to it in the Rules:
class CustomLinkExtractor(LinkExtractor):
def get_domain(self, url):
# https://stackoverflow.com/questions/9626535/get-protocol-host-name-from-url
return '.'.join(tldextract.extract(url)[1:])
def extract_links(self, response):
domain = self.get_domain(response.url)
# https://stackoverflow.com/questions/40701227/using-scrapy-linkextractor-to-locate-specific-domain-extensions
return list(
filter(
lambda link: self.get_domain(link.url) == domain,
super(CustomLinkExtractor, self).extract_links(response)
)
)
But that doesn't work (the spider goes off-domain).
Now I'm trying to use the process_request option in the Rule:
rules = (
Rule(LinkExtractor(deny_domains='twitter.com'),
callback='parse_response',
process_request='check_r_r_domains',
follow=True,
),
)
and
def check_r_r_domains(request, response):
domain0 = '.'.join(tldextract.extract(request.url)[1:])
domain1 = '.'.join(tldextract.extract(response.url)[1:])
log('TEST:', domain0, domain1)
if (domain0 == domain1) and (domain0 != 'twitter.com'):
return request
log(domain0, ' != ', domain1)
return None
but I get an exception because it's passing self
to the method (the spider has no url
attribute); when I add self
to the method signature, I get an exception that the response
positional argument is missing! If I change the callback to process_request=self.check_r_r_domains
, I get an error because self
isn't defined where I set the rules
!
Upvotes: 1
Views: 410
Reputation: 619
Oops, it turns out that conda on the server I'm using had installed a 1.6 version of scrapy. I've forced it to install 1.8.0 from conda-forge and I think it's working now.
Upvotes: 0
Reputation: 3847
If you are using Scrapy 1.7.0 or later, you can pass Rule
a callable, process_request
, to check the URLs of both the request and the response, and drop the request (return None
) if the domains do not match.
Upvotes: 1