Reputation: 37
I want the loop to check each link - if it goes to an external domain to output it - at the moment it outputs all links (internal and external). What have I messed up? (For testing I've tweaked the code to just work from a single page and not crawl the rest of the site.)
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
import re
class MySpider(CrawlSpider):
name = 'crawlspider'
allowed_domains = ['en.wikipedia.org']
start_urls = ['https://en.wikipedia.org/wiki/BBC_News']
rules = (
Rule(LinkExtractor(), callback='parse_item', follow=False),
)
def parse_item(self, response):
item = dict()
item['url'] = response.url
item['title']=response.xpath('//title').extract_first()
for link in LinkExtractor(allow=(),deny=self.allowed_domains).extract_links(response):
item['links']=response.xpath('//a/@href').extract()
return item
Upvotes: 1
Views: 218
Reputation: 5390
The logic in your parse_item
method doesn't look quite right
def parse_item(self, response):
item = dict()
item['url'] = response.url
item['title']=response.xpath('//title').extract_first()
for link in LinkExtractor(allow=(),deny=self.allowed_domains).extract_links(response):
item['links']=response.xpath('//a/@href').extract()
return item
You are looping through each link
from the extractor, but then always setting item["links"]
to be exactly the same thing (all the links from the response page). I expect that you are trying to set item["links"]
to be all the links from the LinkExtractor
? If so you should change the method to
def parse_item(self, response):
item = dict()
item['url'] = response.url
item['title'] = response.xpath('//title').extract_first()
links = [link.url for link in LinkExtractor(deny=self.allowed_domains).extract_links(response)]
item['links'] = links
return item
If you really just want the domains then you can use urlparse
from urllib.parse
to get the netloc
. You might also want to remove duplicates with a set
. So your parse method would become (with the import preferably at the top of your file)
def parse_item(self, response):
from urllib.parse import urlparse
item = dict()
item["url"] = response.url
item["title"] = response.xpath("//title").extract_first()
item["links"] = {
urlparse(link.url).netloc
for link in LinkExtractor(deny=self.allowed_domains).extract_links(response)
}
return item
Upvotes: 2