How to extract urls from an xml using scrapy - XMLFeedSpider?

Question

I have started using Scrapy recently and I'm trying to use the "XMLFeedSpider" to extract and load the pages that are in a xml page. But the problem is that it is returning an error: "IndexError: list index out of range".

I'm trying to collect and load all product pages that are at this address:
"http://www.example.com/feed.xml"

My spider:

from scrapy.spiders import XMLFeedSpider

class PartySpider(XMLFeedSpider):
    name = 'example'
    allowed_domains = ['http://www.example.com']

    start_urls = [      
        'http://www.example.com/feed.xml'
    ]   

    itertag = 'loc'

    def parse_node(self, response, node): 
        self.logger.info('Hi, this is a <%s> node!: %s', self.itertag,''.join(node.extract()))

paul trmbrth · Accepted Answer

This is how your XML input starts:



http://www.example.htm
http://www.example.htm
(...)

And there's acutally a bug in XMLFeedSpider when using (the default) iterator iternodes when the XML document uses a namespace. See this archived discussion in scrapy-users mailinglist.

This spider works, changing the iterator to xml, where you can reference a namespace, here http://www.sitemaps.org/schemas/sitemap/0.9 using the prefix n (it could be anything really), and using this namespace prefix for the tag to look for, here n:loc:

from scrapy.spiders import XMLFeedSpider

class PartySpider(XMLFeedSpider):
    name = 'example'
    allowed_domains = ['example.com']

    start_urls = [      
        'http://www.example.com/example.xml'
    ]   

    namespaces = [('n', 'http://www.sitemaps.org/schemas/sitemap/0.9')]
    itertag = 'n:loc'
    iterator = 'xml'

    def parse_node(self, response, node): 
        self.logger.info('Hi, this is a <%s> node!: %s', self.itertag,''.join(node.extract()))

How to extract urls from an xml using scrapy - XMLFeedSpider?

Answers (1)

Related Questions