Scrapy - Scrape sitemap with LinkExtractor

Question

How would you scrape a sitemap URL with a LinkExtractor?



   
      http://www.example.com/
      2005-01-01
      monthly
      0.8

Linkextractor will target the href attribute of an a tag.

MyLink

How would you use LxmlLinkExtractor to target / elements instead ?

Umair Ayub · Accepted Answer

Try XMLFeedSpider

from scrapy.spiders import XMLFeedSpider
from myproject.items import TestItem

class MySpider(XMLFeedSpider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com/feed.xml']
    iterator = 'iternodes'  # This is actually unnecessary, since it's the default value
    itertag = 'item'

    def parse_node(self, response, node):
        self.logger.info('Hi, this is a <%s> node!: %s', self.itertag, ''.join(node.extract()))

        item = TestItem()
        item['id'] = node.xpath('@id').extract()
        item['name'] = node.xpath('name').extract()
        item['description'] = node.xpath('description').extract()
        return item

Or use Regex to extract all URLs

re.findall(r"(.*?)", your_string, re.DOTALL)

Scrapy - Scrape sitemap with LinkExtractor

Answers (2)

Related Questions