Reputation: 12864
I am creating a sitemap parser with LXML and want to extract the tags with its' values.
The resulted tags, however, always contain the xmlns information e.g. {http://www.sitemaps.org/schemas/sitemap/0.9}loc
.
body = cStringIO.StringIO(item['body'])
parser = etree.XMLParser(recover=True, load_dtd=True, ns_clean=True)
tree = etree.parse(body, parser)
for sitemap in tree.xpath('./*'):
print sitemap.xpath('./*')[0].tag
# prints: {http://www.sitemaps.org/schemas/sitemap/0.9}loc
The sitemap string:
<sitemap xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<loc>http://www.some_page.com/sitemap-page-2010-11.xml</loc>
<lastmod>2011-12-22T15:46:17+00:00</lastmod>
</sitemap>
I want to extract only the tag - here 'loc', without {http://www.sitemaps.org/schemas/sitemap/0.9}
. Is there a way in LXML to configure the parser
or LXML in that way?
Note: I know that I can use a simple regex replacement - a friend told me to ask for help if an implementation feels more complicated than it should be.
Upvotes: 2
Views: 182
Reputation: 473813
Not sure this is the best approach, but it uses lxml
as you've asked and it works:
import cStringIO
from lxml import etree
text = """<sitemap xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<loc>http://www.some_page.com/sitemap-page-2010-11.xml</loc>
<lastmod>2011-12-22T15:46:17+00:00</lastmod>
</sitemap>"""
body = cStringIO.StringIO(text)
parser = etree.XMLParser(recover=True, load_dtd=True, ns_clean=True)
tree = etree.parse(body, parser)
for item in tree.xpath("./*"):
if 'loc' in item.tag:
print item.text
prints
http://www.some_page.com/sitemap-page-2010-11.xml
Hope that helps.
Upvotes: 1
Reputation: 11
i'm not sure if you meant removing the tag and leaving the text. so it goes another answer.
from ehp import *
data = '''
<sitemap xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<loc>http://www.some_page.com/sitemap-page-2010-11.xml</loc>
<lastmod>2011-12-22T15:46:17+00:00</lastmod>
</sitemap>'''
html = Html()
dom = html.feed(data)
for root, ind in dom.find_with_root('loc'):
root.remove(ind)
root.append(Data(ind.text()))
# It would give me.
print dom
""" <sitemap xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" >
<lastmod >2011-12-22T15:46:17+00:00</lastmod>
http://www.some_page.com/sitemap-page-2010-11.xml</sitemap>
"""
Upvotes: 0
Reputation: 11
I would try this with this tool.
htmlparser.sourceforge.net/
a friend told me it was simple and indeed !! much better than beautifulsoup or anything like.
from ehp import *
data = '''
<sitemap xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<loc>http://www.some_page.com/sitemap-page-2010-11.xml</loc>
<lastmod>2011-12-22T15:46:17+00:00</lastmod>
</sitemap>'''
html = Html()
dom = html.feed(data)
seq = [ind.text() for ind in dom.find('loc')]
print seq
# It gives me.
# ['http://www.some_page.com/sitemap-page-2010-11.xml']
Upvotes: 0
Reputation: 28370
In a perfect world you would use an XML parsing or html scraping library to parse your html to make sure you have the exact tags that you need, in context. It is almost certainly simpler, quicker and good enough in this case to simply use a regular expression to match what you need.
>>> import re
>>> samp = """<sitemap xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
... <loc>http://www.some_page.com/sitemap-page-2010-11.xml</loc>
... <lastmod>2011-12-22T15:46:17+00:00</lastmod>
... </sitemap>"""
>>> re.findall(r'<loc>(.*)</loc>', samp)
['http://www.some_page.com/sitemap-page-2010-11.xml']
Upvotes: 2