How to get rid of exceptions.TypeError error?

Question

I am writing a scraper using Scrapy. One of the things I want it to do is to compare the root domain of the current webpage and the root domain of the links within it. If this domains are different, then it has to proceed extracting data. This is my current code:

class MySpider(Spider):
    name = 'smm'
    allowed_domains = ['*']
    start_urls = ['http://en.wikipedia.org/wiki/Social_media']
    def parse(self, response):
        items = []
        for link in response.xpath("//a"):
            #Extract the root domain for the main website from the canonical URL
            hostname1 = link.xpath('/html/head/link[@rel=''canonical'']').extract()
            hostname1 = urlparse(hostname1).hostname
            #Extract the root domain for thelink
            hostname2 = link.xpath('@href').extract()
            hostname2 = urlparse(hostname2).hostname
            #Compare if the root domain of the website and the root domain of the link are different.
            #If so, extract the items & build the dictionary 
            if hostname1 != hostname2:
                item = SocialMediaItem()
                item['SourceTitle'] = link.xpath('/html/head/title').extract()
                item['TargetTitle'] = link.xpath('text()').extract()
                item['link'] = link.xpath('@href').extract()
                items.append(item)
        return items

However, when I run it I get this error:

Traceback (most recent call last):
  File "C:\Anaconda\lib\site-packages	wisted\internet\base.py", line 1201, in mainLoop
    self.runUntilCurrent()
  File "C:\Anaconda\lib\site-packages	wisted\internet\base.py", line 824, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File "C:\Anaconda\lib\site-packages	wisted\internet\defer.py", line 382, in callback
    self._startRunCallbacks(result)
  File "C:\Anaconda\lib\site-packages	wisted\internet\defer.py", line 490, in _startRunCallbacks
    self._runCallbacks()
---  ---
  File "C:\Anaconda\lib\site-packages	wisted\internet\defer.py", line 577, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "E:\Usuarios\Daniel\GitHub\SocialMedia-Web-Scraper\socialmedia\socialmedia\spiders\SocialMedia.py", line 16, in parse
    hostname1 = urlparse(hostname1).hostname
  File "C:\Anaconda\lib\urlparse.py", line 143, in urlparse
    tuple = urlsplit(url, scheme, allow_fragments)
  File "C:\Anaconda\lib\urlparse.py", line 176, in urlsplit
    cached = _parse_cache.get(key, None)
exceptions.TypeError: unhashable type: 'list'

Can anyone help me to get rid of this error? I interpret that it has something to do with list keys, but I don't know how to solve it. Thanks you so much!

Dani

bosnjak · Accepted Answer

There are a few things wrong here:

There is no need to calculate hostname1 in the loop, since it always selects the same rel element, even though used on a sub-selector (due to the nature of the xpath expression, which is absolute rather than relative, but this is the way you need it to be).
The xpath expression for hostname1 is malformed and it returns None, thus the error when trying to get only the first element as proposed by Kevin. You have two single-qoutes in the expression, instead of one escaped single-quote or a double-quote.
You are getting the rel element itself, when you should be getting its @href attribute. The XPath expression should be altered to reflect this.

After resolving these issues, the code could look something like this (not tested):

    def parse(self, response):
        items = []
        hostname1 = response.xpath("/html/head/link[@rel='canonical']/@href").extract()[0]
        hostname1 = urlparse(hostname1).hostname

        for link in response.xpath("//a"):
            hostname2 = (link.xpath('@href').extract() or [''])[0]
            hostname2 = urlparse(hostname2).hostname
            #Compare and extract
            if hostname1 != hostname2:
                ...
        return items

How to get rid of exceptions.TypeError error?

Answers (2)

Related Questions