Reputation: 165
I have started using Scrapy recently and I'm trying to use the "XMLFeedSpider" to extract and load the pages that are in a xml page. But the problem is that it is returning an error: "IndexError: list index out of range".
I'm trying to collect and load all product pages that are at this address:
"http://www.example.com/feed.xml"
My spider:
from scrapy.spiders import XMLFeedSpider
class PartySpider(XMLFeedSpider):
name = 'example'
allowed_domains = ['http://www.example.com']
start_urls = [
'http://www.example.com/feed.xml'
]
itertag = 'loc'
def parse_node(self, response, node):
self.logger.info('Hi, this is a <%s> node!: %s', self.itertag,''.join(node.extract()))
Upvotes: 2
Views: 1033
Reputation: 20748
This is how your XML input starts:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url><loc>http://www.example.htm</loc></url>
<url><loc>http://www.example.htm</loc></url>
(...)
And there's acutally a bug in XMLFeedSpider
when using (the default) iterator iternodes
when the XML document uses a namespace. See this archived discussion in scrapy-users mailinglist.
This spider works, changing the iterator to xml
, where you can reference a namespace, here http://www.sitemaps.org/schemas/sitemap/0.9
using the prefix n
(it could be anything really), and using this namespace prefix for the tag to look for, here n:loc
:
from scrapy.spiders import XMLFeedSpider
class PartySpider(XMLFeedSpider):
name = 'example'
allowed_domains = ['example.com']
start_urls = [
'http://www.example.com/example.xml'
]
namespaces = [('n', 'http://www.sitemaps.org/schemas/sitemap/0.9')]
itertag = 'n:loc'
iterator = 'xml'
def parse_node(self, response, node):
self.logger.info('Hi, this is a <%s> node!: %s', self.itertag,''.join(node.extract()))
Upvotes: 4