Dani Valverde
Dani Valverde

Reputation: 409

How to get rid of exceptions.TypeError error?

I am writing a scraper using Scrapy. One of the things I want it to do is to compare the root domain of the current webpage and the root domain of the links within it. If this domains are different, then it has to proceed extracting data. This is my current code:

class MySpider(Spider):
    name = 'smm'
    allowed_domains = ['*']
    start_urls = ['http://en.wikipedia.org/wiki/Social_media']
    def parse(self, response):
        items = []
        for link in response.xpath("//a"):
            #Extract the root domain for the main website from the canonical URL
            hostname1 = link.xpath('/html/head/link[@rel=''canonical'']').extract()
            hostname1 = urlparse(hostname1).hostname
            #Extract the root domain for thelink
            hostname2 = link.xpath('@href').extract()
            hostname2 = urlparse(hostname2).hostname
            #Compare if the root domain of the website and the root domain of the link are different.
            #If so, extract the items & build the dictionary 
            if hostname1 != hostname2:
                item = SocialMediaItem()
                item['SourceTitle'] = link.xpath('/html/head/title').extract()
                item['TargetTitle'] = link.xpath('text()').extract()
                item['link'] = link.xpath('@href').extract()
                items.append(item)
        return items

However, when I run it I get this error:

Traceback (most recent call last):
  File "C:\Anaconda\lib\site-packages\twisted\internet\base.py", line 1201, in mainLoop
    self.runUntilCurrent()
  File "C:\Anaconda\lib\site-packages\twisted\internet\base.py", line 824, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File "C:\Anaconda\lib\site-packages\twisted\internet\defer.py", line 382, in callback
    self._startRunCallbacks(result)
  File "C:\Anaconda\lib\site-packages\twisted\internet\defer.py", line 490, in _startRunCallbacks
    self._runCallbacks()
--- <exception caught here> ---
  File "C:\Anaconda\lib\site-packages\twisted\internet\defer.py", line 577, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "E:\Usuarios\Daniel\GitHub\SocialMedia-Web-Scraper\socialmedia\socialmedia\spiders\SocialMedia.py", line 16, in parse
    hostname1 = urlparse(hostname1).hostname
  File "C:\Anaconda\lib\urlparse.py", line 143, in urlparse
    tuple = urlsplit(url, scheme, allow_fragments)
  File "C:\Anaconda\lib\urlparse.py", line 176, in urlsplit
    cached = _parse_cache.get(key, None)
exceptions.TypeError: unhashable type: 'list'

Can anyone help me to get rid of this error? I interpret that it has something to do with list keys, but I don't know how to solve it. Thanks you so much!

Dani

Upvotes: 1

Views: 836

Answers (2)

bosnjak
bosnjak

Reputation: 8624

There are a few things wrong here:

  1. There is no need to calculate hostname1 in the loop, since it always selects the same rel element, even though used on a sub-selector (due to the nature of the xpath expression, which is absolute rather than relative, but this is the way you need it to be).

  2. The xpath expression for hostname1 is malformed and it returns None, thus the error when trying to get only the first element as proposed by Kevin. You have two single-qoutes in the expression, instead of one escaped single-quote or a double-quote.

  3. You are getting the rel element itself, when you should be getting its @href attribute. The XPath expression should be altered to reflect this.

After resolving these issues, the code could look something like this (not tested):

    def parse(self, response):
        items = []
        hostname1 = response.xpath("/html/head/link[@rel='canonical']/@href").extract()[0]
        hostname1 = urlparse(hostname1).hostname

        for link in response.xpath("//a"):
            hostname2 = (link.xpath('@href').extract() or [''])[0]
            hostname2 = urlparse(hostname2).hostname
            #Compare and extract
            if hostname1 != hostname2:
                ...
        return items

Upvotes: 2

Kevin
Kevin

Reputation: 76254

hostname1 = link.xpath('/html/head/link[@rel=''canonical'']').extract()
hostname1 = urlparse(hostname1).hostname

extract returns a list of strings, but urlparse accepts only one string. Perhaps you should discard all but the first hostname found.

hostname1 = link.xpath('/html/head/link[@rel=''canonical'']').extract()[0]
hostname1 = urlparse(hostname1).hostname

And likewise for the other hostname.

hostname2 = link.xpath('@href').extract()[0]
hostname2 = urlparse(hostname2).hostname

If you're not certain whether the document even has a hostname, it may be useful to look before you leap.

hostname1 = link.xpath('/html/head/link[@rel=''canonical'']').extract()
if not hostname1: continue
hostname1 = urlparse(hostname1[0]).hostname

hostname2 = link.xpath('@href').extract()
if not hostname2: continue
hostname2 = urlparse(hostname2[0]).hostname

Upvotes: 1

Related Questions