Reputation: 3349
I'm new to Scrapy and am following the basic documentation.
I have a site I'm trying to scrape some links from in order to then navigate some links within those. I'm specifically trying to get Cokelore, College, and Computers and I'm using my code below
import scrapy
class DmozSpider(scrapy.Spider):
name = "snopes"
allowed_domains = ["snopes.com"]
start_urls = [
"http://www.snopes.com/info/whatsnew.asp"
]
def parse(self, response):
print response.xpath('//div[@class="navHeader"]/ul/')
filename = response.url.split("/")[-2] + '.html'
with open(filename, 'wb') as f:
f.write(response.body)
This is my error
2015-10-03 23:17:29 [scrapy] INFO: Enabled item pipelines:
2015-10-03 23:17:29 [scrapy] INFO: Spider opened
2015-10-03 23:17:29 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-10-03 23:17:29 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-10-03 23:17:30 [scrapy] DEBUG: Crawled (200) <GET http://www.snopes.com/info/whatsnew.asp> (referer: None)
2015-10-03 23:17:30 [scrapy] ERROR: Spider error processing <GET http://www.snopes.com/info/whatsnew.asp> (referer: None)
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/internet/defer.py", line 588, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/Users/Gaby/Documents/Code/School/689/tutorial/tutorial/spiders/dmoz_spider.py", line 11, in parse
print response.xpath('//div[@class="navHeader"]/ul/')
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapy/http/response/text.py", line 109, in xpath
return self.selector.xpath(query)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapy/selector/unified.py", line 100, in xpath
raise ValueError(msg if six.PY3 else msg.encode("unicode_escape"))
ValueError: Invalid XPath: //div[@class="navHeader"]/ul/
2015-10-03 23:17:30 [scrapy] INFO: Closing spider (finished)
2015-10-03 23:17:30 [scrapy] INFO: Dumping Scrapy stats:
The error I'm getting I think has to do with the /ul
in my xpath()
but I can't figure out why. //div[@class="navHeader"]
works fine on it's own, and it starts breaking once I start adding attributes after that.
The part of the website I'm trying to scrape is structured like so
<DIV CLASS="navHeader">CATEGORIES:</DIV>
<UL>
<LI><A HREF="/autos/autos.asp">Autos</A></LI>
<LI><A HREF="/business/business.asp">Business</A></LI>
<LI><A HREF="/cokelore/cokelore.asp">Cokelore</A></LI>
<LI><A HREF="/college/college.asp">College</A></LI>
<LI><A HREF="/computer/computer.asp">Computers</A></LI>
</UL>
<DIV CLASS="navSpacer"> </DIV>
<UL>
<LI><A HREF="/crime/crime.asp">Crime</A></LI>
<LI><A HREF="/critters/critters.asp">Critter Country</A></LI>
<LI><A HREF="/disney/disney.asp">Disney</A></LI>
<LI><A HREF="/embarrass/embarrass.asp">Embarrassments</A></LI>
<LI><A HREF="/photos/photos.asp">Fauxtography</A></LI>
</UL>
Upvotes: 1
Views: 780
Reputation: 474003
You just need to remove the trailing /
. Replace:
//div[@class="navHeader"]/ul/
with:
//div[@class="navHeader"]/ul
Note that this XPath would actually match nothing on the page. The ul
element is a sibling of the navigation header - use following-sibling
:
In [1]: response.xpath('//div[@class="navHeader"]/following-sibling::ul//li/a/text()').extract()
Out[1]:
[u'Autos',
u'Business',
u'Cokelore',
u'College',
# ...
u'Weddings']
Upvotes: 1