Rafa
Rafa

Reputation: 3349

Scrapy website crawler returns invalid path error

I'm new to Scrapy and am following the basic documentation.

I have a site I'm trying to scrape some links from in order to then navigate some links within those. I'm specifically trying to get Cokelore, College, and Computers and I'm using my code below

import scrapy 

class DmozSpider(scrapy.Spider): 
    name = "snopes" 
    allowed_domains = ["snopes.com"] 
    start_urls = [ 
            "http://www.snopes.com/info/whatsnew.asp" 
    ] 

    def parse(self, response): 
            print response.xpath('//div[@class="navHeader"]/ul/') 
            filename = response.url.split("/")[-2] + '.html' 
            with open(filename, 'wb') as f: 
                    f.write(response.body)

This is my error

2015-10-03 23:17:29 [scrapy] INFO: Enabled item pipelines: 
2015-10-03 23:17:29 [scrapy] INFO: Spider opened
2015-10-03 23:17:29 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-10-03 23:17:29 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-10-03 23:17:30 [scrapy] DEBUG: Crawled (200) <GET http://www.snopes.com/info/whatsnew.asp> (referer: None)
2015-10-03 23:17:30 [scrapy] ERROR: Spider error processing <GET http://www.snopes.com/info/whatsnew.asp> (referer: None)
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/internet/defer.py", line 588, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/Users/Gaby/Documents/Code/School/689/tutorial/tutorial/spiders/dmoz_spider.py", line 11, in parse
    print response.xpath('//div[@class="navHeader"]/ul/')
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapy/http/response/text.py", line 109, in xpath
    return self.selector.xpath(query)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapy/selector/unified.py", line 100, in xpath
    raise ValueError(msg if six.PY3 else msg.encode("unicode_escape"))
ValueError: Invalid XPath: //div[@class="navHeader"]/ul/
2015-10-03 23:17:30 [scrapy] INFO: Closing spider (finished)
2015-10-03 23:17:30 [scrapy] INFO: Dumping Scrapy stats:

The error I'm getting I think has to do with the /ul in my xpath() but I can't figure out why. //div[@class="navHeader"] works fine on it's own, and it starts breaking once I start adding attributes after that.

The part of the website I'm trying to scrape is structured like so

<DIV CLASS="navHeader">CATEGORIES:</DIV>
    <UL>
        <LI><A HREF="/autos/autos.asp">Autos</A></LI>
        <LI><A HREF="/business/business.asp">Business</A></LI>
        <LI><A HREF="/cokelore/cokelore.asp">Cokelore</A></LI>
        <LI><A HREF="/college/college.asp">College</A></LI>
        <LI><A HREF="/computer/computer.asp">Computers</A></LI>
    </UL>
<DIV CLASS="navSpacer"> &nbsp; </DIV>
    <UL>
        <LI><A HREF="/crime/crime.asp">Crime</A></LI>
        <LI><A HREF="/critters/critters.asp">Critter Country</A></LI>
        <LI><A HREF="/disney/disney.asp">Disney</A></LI>
        <LI><A HREF="/embarrass/embarrass.asp">Embarrassments</A></LI>
        <LI><A HREF="/photos/photos.asp">Fauxtography</A></LI>
    </UL>

Upvotes: 1

Views: 780

Answers (1)

alecxe
alecxe

Reputation: 474003

You just need to remove the trailing /. Replace:

//div[@class="navHeader"]/ul/

with:

//div[@class="navHeader"]/ul

Note that this XPath would actually match nothing on the page. The ul element is a sibling of the navigation header - use following-sibling:

In [1]: response.xpath('//div[@class="navHeader"]/following-sibling::ul//li/a/text()').extract()
Out[1]: 
[u'Autos',
 u'Business',
 u'Cokelore',
 u'College',
 # ...
 u'Weddings']

Upvotes: 1

Related Questions