Reputation: 919
How would you scrape a sitemap URL with a LinkExtractor?
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://www.example.com/</loc>
<lastmod>2005-01-01</lastmod>
<changefreq>monthly</changefreq>
<priority>0.8</priority>
</url>
</urlset>
Linkextractor will target the href attribute of an a tag.
<a href="http://mylink.com">MyLink</a>
How would you use LxmlLinkExtractor to target <url>
/<loc>
elements instead ?
Upvotes: 2
Views: 1615
Reputation: 21271
Try XMLFeedSpider
from scrapy.spiders import XMLFeedSpider
from myproject.items import TestItem
class MySpider(XMLFeedSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com/feed.xml']
iterator = 'iternodes' # This is actually unnecessary, since it's the default value
itertag = 'item'
def parse_node(self, response, node):
self.logger.info('Hi, this is a <%s> node!: %s', self.itertag, ''.join(node.extract()))
item = TestItem()
item['id'] = node.xpath('@id').extract()
item['name'] = node.xpath('name').extract()
item['description'] = node.xpath('description').extract()
return item
Or use Regex to extract all URLs
re.findall(r"<loc>(.*?)</loc>", your_string, re.DOTALL)
Upvotes: 3
Reputation: 917
In this case you could use bs4.
from bs4 import BeautifulSoup as bs
XML = ''' <?xml version="1.0" encoding..... '''
bs=bs(XML)
urlset_tag = bs.find_all('urlset')
##out: list with one element --> [<urlset xmlns="http://www.si....]
link = urlset_tag[0].find_all('loc')
##out: [<loc>http://www.example.com/</loc>]
link_str=str(link[0].text)
##out:'http://www.example.com/'
If you hace more tags urlset, you should go through a loop because the list length will be greater than one:
links=[]
for link in urlset_tag:
links.append(str(link.find_all('loc')[0].text))
Upvotes: 1