Reputation: 409
I am writing a scraper using Scrapy. One of the things I want it to do is to compare the root domain of the current webpage and the root domain of the links within it. If this domains are different, then it has to proceed extracting data. This is my current code:
class MySpider(Spider):
name = 'smm'
allowed_domains = ['*']
start_urls = ['http://en.wikipedia.org/wiki/Social_media']
def parse(self, response):
items = []
for link in response.xpath("//a"):
#Extract the root domain for the main website from the canonical URL
hostname1 = link.xpath('/html/head/link[@rel=''canonical'']').extract()
hostname1 = urlparse(hostname1).hostname
#Extract the root domain for thelink
hostname2 = link.xpath('@href').extract()
hostname2 = urlparse(hostname2).hostname
#Compare if the root domain of the website and the root domain of the link are different.
#If so, extract the items & build the dictionary
if hostname1 != hostname2:
item = SocialMediaItem()
item['SourceTitle'] = link.xpath('/html/head/title').extract()
item['TargetTitle'] = link.xpath('text()').extract()
item['link'] = link.xpath('@href').extract()
items.append(item)
return items
However, when I run it I get this error:
Traceback (most recent call last):
File "C:\Anaconda\lib\site-packages\twisted\internet\base.py", line 1201, in mainLoop
self.runUntilCurrent()
File "C:\Anaconda\lib\site-packages\twisted\internet\base.py", line 824, in runUntilCurrent
call.func(*call.args, **call.kw)
File "C:\Anaconda\lib\site-packages\twisted\internet\defer.py", line 382, in callback
self._startRunCallbacks(result)
File "C:\Anaconda\lib\site-packages\twisted\internet\defer.py", line 490, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "C:\Anaconda\lib\site-packages\twisted\internet\defer.py", line 577, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "E:\Usuarios\Daniel\GitHub\SocialMedia-Web-Scraper\socialmedia\socialmedia\spiders\SocialMedia.py", line 16, in parse
hostname1 = urlparse(hostname1).hostname
File "C:\Anaconda\lib\urlparse.py", line 143, in urlparse
tuple = urlsplit(url, scheme, allow_fragments)
File "C:\Anaconda\lib\urlparse.py", line 176, in urlsplit
cached = _parse_cache.get(key, None)
exceptions.TypeError: unhashable type: 'list'
Can anyone help me to get rid of this error? I interpret that it has something to do with list keys, but I don't know how to solve it. Thanks you so much!
Dani
Upvotes: 1
Views: 836
Reputation: 8624
There are a few things wrong here:
There is no need to calculate hostname1
in the loop, since it always selects the same rel
element, even though used on a sub-selector (due to the nature of the xpath expression, which is absolute rather than relative, but this is the way you need it to be).
The xpath expression for hostname1
is malformed and it returns None, thus the error when trying to get only the first element as proposed by Kevin. You have two single-qoutes in the expression, instead of one escaped single-quote or a double-quote.
You are getting the rel
element itself, when you should be getting its @href
attribute. The XPath expression should be altered to reflect this.
After resolving these issues, the code could look something like this (not tested):
def parse(self, response):
items = []
hostname1 = response.xpath("/html/head/link[@rel='canonical']/@href").extract()[0]
hostname1 = urlparse(hostname1).hostname
for link in response.xpath("//a"):
hostname2 = (link.xpath('@href').extract() or [''])[0]
hostname2 = urlparse(hostname2).hostname
#Compare and extract
if hostname1 != hostname2:
...
return items
Upvotes: 2
Reputation: 76254
hostname1 = link.xpath('/html/head/link[@rel=''canonical'']').extract()
hostname1 = urlparse(hostname1).hostname
extract
returns a list of strings, but urlparse
accepts only one string. Perhaps you should discard all but the first hostname found.
hostname1 = link.xpath('/html/head/link[@rel=''canonical'']').extract()[0]
hostname1 = urlparse(hostname1).hostname
And likewise for the other hostname.
hostname2 = link.xpath('@href').extract()[0]
hostname2 = urlparse(hostname2).hostname
If you're not certain whether the document even has a hostname, it may be useful to look before you leap.
hostname1 = link.xpath('/html/head/link[@rel=''canonical'']').extract()
if not hostname1: continue
hostname1 = urlparse(hostname1[0]).hostname
hostname2 = link.xpath('@href').extract()
if not hostname2: continue
hostname2 = urlparse(hostname2[0]).hostname
Upvotes: 1