Reputation: 2449
I'm trying to make a simple spider to grab some xml and spit it out in a new format for an experiment. However it seems there is extra code contained within the xml which is spat out. The format I want is like this (no extra code or value tag) along the lines of this: <body>Don't forget me this weekend!</body>
I think I am using xpath wrong but I'm not sure what I'm doing wrong.
Spider
from scrapy.contrib.spiders import XMLFeedSpider
from crawler.items import CrawlerItem
class SiteSpider(XMLFeedSpider):
name = 'site'
allowed_domains = ['www.w3schools.com']
start_urls = ['http://www.w3schools.com/xml/note.xml']
itertag = 'note'
def parse_node(self, response):
xxs = XmlXPathSelector(response)
to = xxs.select('//to')
who = xxs.select('//from')
heading = xxs.select('//heading')
body = xxs.select('//body')
return item
Input
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
Curreont (wrong) Output
<?xml version="1.0" encoding="UTF-8"?>
<items>
<item>
<body>
<value><body>Don't forget me this weekend!</body></value>
</body>
<to>
<value><to>Tove</to></value>
</to>
<who>
<value><from>Jani</from></value>
</who>
<heading>
<value><heading>Reminder</heading></value>
</heading>
</item>
</items>
Upvotes: 1
Views: 708
Reputation: 473753
The signature of parse_node()
is incorrect. There should be selector
argument given which you should call xpath()
method on, example:
def parse_node(self, response, selector):
to = selector.xpath('//to/text()').extract()
who = selector.xpath('//from/text()').extract()
print to, who
Prints:
[u'Tove'] [u'Jani']
Upvotes: 1