Jono Ellis
Jono Ellis

Reputation: 37

How to I get Python Scrapy to extract all of the domains of all external links from a web page?

I want the loop to check each link - if it goes to an external domain to output it - at the moment it outputs all links (internal and external). What have I messed up? (For testing I've tweaked the code to just work from a single page and not crawl the rest of the site.)

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
import re

class MySpider(CrawlSpider):
    name = 'crawlspider'
    allowed_domains = ['en.wikipedia.org']
    start_urls = ['https://en.wikipedia.org/wiki/BBC_News']

    rules = (
        Rule(LinkExtractor(), callback='parse_item', follow=False),
    )

    def parse_item(self, response):
        item = dict()
        item['url'] = response.url
        item['title']=response.xpath('//title').extract_first()
        for link in LinkExtractor(allow=(),deny=self.allowed_domains).extract_links(response):
            item['links']=response.xpath('//a/@href').extract()
        return item

Upvotes: 1

Views: 218

Answers (1)

tomjn
tomjn

Reputation: 5390

The logic in your parse_item method doesn't look quite right

def parse_item(self, response):
    item = dict()
    item['url'] = response.url
    item['title']=response.xpath('//title').extract_first()
    for link in LinkExtractor(allow=(),deny=self.allowed_domains).extract_links(response):
        item['links']=response.xpath('//a/@href').extract()
    return item

You are looping through each link from the extractor, but then always setting item["links"] to be exactly the same thing (all the links from the response page). I expect that you are trying to set item["links"] to be all the links from the LinkExtractor? If so you should change the method to

def parse_item(self, response):
    item = dict()
    item['url'] = response.url
    item['title'] = response.xpath('//title').extract_first()
    links = [link.url for link in LinkExtractor(deny=self.allowed_domains).extract_links(response)]        
    item['links'] = links
    return item

If you really just want the domains then you can use urlparse from urllib.parse to get the netloc. You might also want to remove duplicates with a set. So your parse method would become (with the import preferably at the top of your file)

def parse_item(self, response):
    from urllib.parse import urlparse
    item = dict()
    item["url"] = response.url
    item["title"] = response.xpath("//title").extract_first()
    item["links"] = {
        urlparse(link.url).netloc
        for link in LinkExtractor(deny=self.allowed_domains).extract_links(response)
    }   
    return item

Upvotes: 2

Related Questions