Slyper
Slyper

Reputation: 906

Scrapy: Linkextractor Rule not working

I have tried 3 different variations of LinkExtractor, but its still ignoring the 'deny' rule and crawling sub-domains in all 3 variations.... I want to EXCLUDE the sub-domains from the crawl.

Tried with 'allow' rule only. To only allow the main domain i.e. example.edu.uk

rules = [Rule(LinkExtractor(allow=(r'^example\.edu.uk(\/.*)?$',)))] // Not Working

Tried with 'deny' rule only. To deny all sub-domains i.e. sub.example.edu.uk

rules = [Rule(LinkExtractor(deny=(r'(?<=\.)[a-z0-9-]*\.edu\.uk',)))] // Not Working

Tried with both 'allow & deny' rule

rules = [Rule(LinkExtractor(allow=(r'^http:\/\/example\.edu\.uk(\/.*)?$'),deny=(r'(?<=\.)[a-z0-9-]*\.edu\.uk',)))] // Not Working

Example:

Follow these links

Discard sub-domain links

Here is the complete code ...

class NewsFields(Item):
    pagetype = Field()
    pagetitle = Field()
    pageurl = Field()
    pagedate = Field()
    pagedescription = Field()
    bodytext = Field()

class MySpider(CrawlSpider):
    name = 'profiles'
    start_urls = ['http://www.example.edu.uk/listing']
    allowed_domains = ['example.edu.uk']
    rules = (Rule(LinkExtractor(allow=(r'^https?://example.edu.uk/.*', ))), )
    def parse(self, response):
        hxs = Selector(response)
        soup = BeautifulSoup(response.body, 'lxml')
        nf = NewsFields()
        ptype = soup.find_all(attrs={"name":"nkdpagetype"})
        ptitle = soup.find_all(attrs={"name":"nkdpagetitle"})
        pturl = soup.find_all(attrs={"name":"nkdpageurl"})
        ptdate = soup.find_all(attrs={"name":"nkdpagedate"})
        ptdesc = soup.find_all(attrs={"name":"nkdpagedescription"})
        for node in soup.find_all("div", id="main-content__wrapper"):
             ptbody = ''.join(node.find_all(text=True))
             ptbody = ' '.join(ptbody.split())
             nf['pagetype'] = ptype[0]['content'].encode('ascii', 'ignore')
             nf['pagetitle'] = ptitle[0]['content'].encode('ascii', 'ignore')
             nf['pageurl'] = pturl[0]['content'].encode('ascii', 'ignore')
             nf['pagedate'] = ptdate[0]['content'].encode('ascii', 'ignore')
             nf['pagedescription'] = ptdesc[0]['content'].encode('ascii', 'ignore')
             nf['bodytext'] = ptbody.encode('ascii', 'ignore')
             yield nf
            for url in hxs.xpath('//p/a/@href').extract():
             yield Request(response.urljoin(url), callback=self.parse)

Can someone please help ? Thanks

Upvotes: 1

Views: 3527

Answers (1)

Tarun Lalwani
Tarun Lalwani

Reputation: 146510

Your first 2 rules are wrong

rules = [Rule(LinkExtractor(allow=(r'^example\.edu.uk(\/.*)?$',)))] // Not Working
rules = [Rule(LinkExtractor(deny=(r'(?<=\.)[a-z0-9-]*\.edu\.uk',)))] // Not Working

The allow and deny are for absolute urls and not domain. The below should work for you

rules = (Rule(LinkExtractor(allow=(r'^https?://example.edu.uk/.*', ))), )

Edit-1

First you should change below

allowed_domains = ['example.edu.uk']

to

allowed_domains = ['www.example.edu.uk']

Second your rules for extracting URL should be

rules = (Rule(LinkExtractor(allow=(r'^https?://www.example.edu.uk/.*', ))), )

Third, in your below code

for url in hxs.xpath('//p/a/@href').extract():
         yield Request(response.urljoin(url), callback=self.parse)

Rules will not be applied. Your yields are subject to rules. Rule will insert new request automatically, but they will not prevent you from yielding other links which are not allowed by the rule config. But setting allowed_domains will be applicable to both rules and your yield

Upvotes: 2

Related Questions