Reputation: 906
I have tried 3 different variations of LinkExtractor, but its still ignoring the 'deny' rule and crawling sub-domains in all 3 variations.... I want to EXCLUDE the sub-domains from the crawl.
Tried with 'allow' rule only. To only allow the main domain i.e. example.edu.uk
rules = [Rule(LinkExtractor(allow=(r'^example\.edu.uk(\/.*)?$',)))] // Not Working
Tried with 'deny' rule only. To deny all sub-domains i.e. sub.example.edu.uk
rules = [Rule(LinkExtractor(deny=(r'(?<=\.)[a-z0-9-]*\.edu\.uk',)))] // Not Working
Tried with both 'allow & deny' rule
rules = [Rule(LinkExtractor(allow=(r'^http:\/\/example\.edu\.uk(\/.*)?$'),deny=(r'(?<=\.)[a-z0-9-]*\.edu\.uk',)))] // Not Working
Example:
Follow these links
Discard sub-domain links
Here is the complete code ...
class NewsFields(Item):
pagetype = Field()
pagetitle = Field()
pageurl = Field()
pagedate = Field()
pagedescription = Field()
bodytext = Field()
class MySpider(CrawlSpider):
name = 'profiles'
start_urls = ['http://www.example.edu.uk/listing']
allowed_domains = ['example.edu.uk']
rules = (Rule(LinkExtractor(allow=(r'^https?://example.edu.uk/.*', ))), )
def parse(self, response):
hxs = Selector(response)
soup = BeautifulSoup(response.body, 'lxml')
nf = NewsFields()
ptype = soup.find_all(attrs={"name":"nkdpagetype"})
ptitle = soup.find_all(attrs={"name":"nkdpagetitle"})
pturl = soup.find_all(attrs={"name":"nkdpageurl"})
ptdate = soup.find_all(attrs={"name":"nkdpagedate"})
ptdesc = soup.find_all(attrs={"name":"nkdpagedescription"})
for node in soup.find_all("div", id="main-content__wrapper"):
ptbody = ''.join(node.find_all(text=True))
ptbody = ' '.join(ptbody.split())
nf['pagetype'] = ptype[0]['content'].encode('ascii', 'ignore')
nf['pagetitle'] = ptitle[0]['content'].encode('ascii', 'ignore')
nf['pageurl'] = pturl[0]['content'].encode('ascii', 'ignore')
nf['pagedate'] = ptdate[0]['content'].encode('ascii', 'ignore')
nf['pagedescription'] = ptdesc[0]['content'].encode('ascii', 'ignore')
nf['bodytext'] = ptbody.encode('ascii', 'ignore')
yield nf
for url in hxs.xpath('//p/a/@href').extract():
yield Request(response.urljoin(url), callback=self.parse)
Can someone please help ? Thanks
Upvotes: 1
Views: 3527
Reputation: 146510
Your first 2 rules are wrong
rules = [Rule(LinkExtractor(allow=(r'^example\.edu.uk(\/.*)?$',)))] // Not Working
rules = [Rule(LinkExtractor(deny=(r'(?<=\.)[a-z0-9-]*\.edu\.uk',)))] // Not Working
The allow and deny are for absolute urls and not domain. The below should work for you
rules = (Rule(LinkExtractor(allow=(r'^https?://example.edu.uk/.*', ))), )
Edit-1
First you should change below
allowed_domains = ['example.edu.uk']
to
allowed_domains = ['www.example.edu.uk']
Second your rules for extracting URL should be
rules = (Rule(LinkExtractor(allow=(r'^https?://www.example.edu.uk/.*', ))), )
Third, in your below code
for url in hxs.xpath('//p/a/@href').extract():
yield Request(response.urljoin(url), callback=self.parse)
Rules will not be applied. Your yields are subject to rules. Rule will insert new request automatically, but they will not prevent you from yielding other links which are not allowed by the rule config. But setting allowed_domains
will be applicable to both rules and your yield
Upvotes: 2