Scrapy website crawler returns invalid path error

Question

I'm new to Scrapy and am following the basic documentation.

I have a site I'm trying to scrape some links from in order to then navigate some links within those. I'm specifically trying to get Cokelore, College, and Computers and I'm using my code below

import scrapy 

class DmozSpider(scrapy.Spider): 
    name = "snopes" 
    allowed_domains = ["snopes.com"] 
    start_urls = [ 
            "http://www.snopes.com/info/whatsnew.asp" 
    ] 

    def parse(self, response): 
            print response.xpath('//div[@class="navHeader"]/ul/') 
            filename = response.url.split("/")[-2] + '.html' 
            with open(filename, 'wb') as f: 
                    f.write(response.body)

This is my error

2015-10-03 23:17:29 [scrapy] INFO: Enabled item pipelines: 
2015-10-03 23:17:29 [scrapy] INFO: Spider opened
2015-10-03 23:17:29 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-10-03 23:17:29 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-10-03 23:17:30 [scrapy] DEBUG: Crawled (200)  (referer: None)
2015-10-03 23:17:30 [scrapy] ERROR: Spider error processing  (referer: None)
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/internet/defer.py", line 588, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/Users/Gaby/Documents/Code/School/689/tutorial/tutorial/spiders/dmoz_spider.py", line 11, in parse
    print response.xpath('//div[@class="navHeader"]/ul/')
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapy/http/response/text.py", line 109, in xpath
    return self.selector.xpath(query)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapy/selector/unified.py", line 100, in xpath
    raise ValueError(msg if six.PY3 else msg.encode("unicode_escape"))
ValueError: Invalid XPath: //div[@class="navHeader"]/ul/
2015-10-03 23:17:30 [scrapy] INFO: Closing spider (finished)
2015-10-03 23:17:30 [scrapy] INFO: Dumping Scrapy stats:

The error I'm getting I think has to do with the /ul in my xpath() but I can't figure out why. //div[@class="navHeader"] works fine on it's own, and it starts breaking once I start adding attributes after that.

The part of the website I'm trying to scrape is structured like so

CATEGORIES:
    
        Autos
        Business
        Cokelore
        College
        Computers
    
   
    
        Crime
        Critter Country
        Disney
        Embarrassments
        Fauxtography

alecxe · Accepted Answer

You just need to remove the trailing /. Replace:

//div[@class="navHeader"]/ul/

with:

//div[@class="navHeader"]/ul

Note that this XPath would actually match nothing on the page. The ul element is a sibling of the navigation header - use following-sibling:

In [1]: response.xpath('//div[@class="navHeader"]/following-sibling::ul//li/a/text()').extract()
Out[1]: 
[u'Autos',
 u'Business',
 u'Cokelore',
 u'College',
 # ...
 u'Weddings']

Scrapy website crawler returns invalid path error

Answers (1)

Related Questions