Scrapy: Linkextractor Rule not working

Question

I have tried 3 different variations of LinkExtractor, but its still ignoring the 'deny' rule and crawling sub-domains in all 3 variations.... I want to EXCLUDE the sub-domains from the crawl.

Tried with 'allow' rule only. To only allow the main domain i.e. example.edu.uk

rules = [Rule(LinkExtractor(allow=(r'^example\.edu.uk(\/.*)?$',)))] // Not Working

Tried with 'deny' rule only. To deny all sub-domains i.e. sub.example.edu.uk

rules = [Rule(LinkExtractor(deny=(r'(?<=\.)[a-z0-9-]*\.edu\.uk',)))] // Not Working

Tried with both 'allow & deny' rule

rules = [Rule(LinkExtractor(allow=(r'^http:\/\/example\.edu\.uk(\/.*)?$'),deny=(r'(?<=\.)[a-z0-9-]*\.edu\.uk',)))] // Not Working

Example:

Follow these links

example.edu.uk/fsdfs.htm
example.edu.uk/nkln.htm
example.edu.uk/vefr.htm
example.edu.uk/opji.htm

Discard sub-domain links

sub-domain.example.edu.uk/fsdfs.htm
sub-domain.example.edu.uk/nkln.htm
sub-domain.example.edu.uk/vefr.htm
sub-domain.example.edu.uk/opji.htm

Here is the complete code ...

class NewsFields(Item):
    pagetype = Field()
    pagetitle = Field()
    pageurl = Field()
    pagedate = Field()
    pagedescription = Field()
    bodytext = Field()

class MySpider(CrawlSpider):
    name = 'profiles'
    start_urls = ['http://www.example.edu.uk/listing']
    allowed_domains = ['example.edu.uk']
    rules = (Rule(LinkExtractor(allow=(r'^https?://example.edu.uk/.*', ))), )
    def parse(self, response):
        hxs = Selector(response)
        soup = BeautifulSoup(response.body, 'lxml')
        nf = NewsFields()
        ptype = soup.find_all(attrs={"name":"nkdpagetype"})
        ptitle = soup.find_all(attrs={"name":"nkdpagetitle"})
        pturl = soup.find_all(attrs={"name":"nkdpageurl"})
        ptdate = soup.find_all(attrs={"name":"nkdpagedate"})
        ptdesc = soup.find_all(attrs={"name":"nkdpagedescription"})
        for node in soup.find_all("div", id="main-content__wrapper"):
             ptbody = ''.join(node.find_all(text=True))
             ptbody = ' '.join(ptbody.split())
             nf['pagetype'] = ptype[0]['content'].encode('ascii', 'ignore')
             nf['pagetitle'] = ptitle[0]['content'].encode('ascii', 'ignore')
             nf['pageurl'] = pturl[0]['content'].encode('ascii', 'ignore')
             nf['pagedate'] = ptdate[0]['content'].encode('ascii', 'ignore')
             nf['pagedescription'] = ptdesc[0]['content'].encode('ascii', 'ignore')
             nf['bodytext'] = ptbody.encode('ascii', 'ignore')
             yield nf
            for url in hxs.xpath('//p/a/@href').extract():
             yield Request(response.urljoin(url), callback=self.parse)

Can someone please help ? Thanks

Tarun Lalwani · Accepted Answer

Your first 2 rules are wrong

rules = [Rule(LinkExtractor(allow=(r'^example\.edu.uk(\/.*)?$',)))] // Not Working
rules = [Rule(LinkExtractor(deny=(r'(?<=\.)[a-z0-9-]*\.edu\.uk',)))] // Not Working

The allow and deny are for absolute urls and not domain. The below should work for you

rules = (Rule(LinkExtractor(allow=(r'^https?://example.edu.uk/.*', ))), )

Edit-1

First you should change below

allowed_domains = ['example.edu.uk']

to

allowed_domains = ['www.example.edu.uk']

Second your rules for extracting URL should be

rules = (Rule(LinkExtractor(allow=(r'^https?://www.example.edu.uk/.*', ))), )

Third, in your below code

for url in hxs.xpath('//p/a/@href').extract():
         yield Request(response.urljoin(url), callback=self.parse)

Rules will not be applied. Your yields are subject to rules. Rule will insert new request automatically, but they will not prevent you from yielding other links which are not allowed by the rule config. But setting allowed_domains will be applicable to both rules and your yield

Scrapy: Linkextractor Rule not working

Answers (1)

Related Questions