Reputation: 12864

LXML: remove the x

I am creating a sitemap parser with LXML and want to extract the tags with its' values. The resulted tags, however, always contain the xmlns information e.g. {http://www.sitemaps.org/schemas/sitemap/0.9}loc.

body = cStringIO.StringIO(item['body'])
parser = etree.XMLParser(recover=True, load_dtd=True, ns_clean=True)
tree = etree.parse(body, parser)

for sitemap in tree.xpath('./*'):
    print sitemap.xpath('./*')[0].tag
    # prints: {http://www.sitemaps.org/schemas/sitemap/0.9}loc

The sitemap string:

<sitemap xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <loc>http://www.some_page.com/sitemap-page-2010-11.xml</loc>
  <lastmod>2011-12-22T15:46:17+00:00</lastmod>
</sitemap>

I want to extract only the tag - here 'loc', without {http://www.sitemaps.org/schemas/sitemap/0.9}. Is there a way in LXML to configure the parser or LXML in that way?

Note: I know that I can use a simple regex replacement - a friend told me to ask for help if an implementation feels more complicated than it should be.

Upvotes: 2

Answers (4)

alecxe

Reputation: 473813

Not sure this is the best approach, but it uses lxml as you've asked and it works:

import cStringIO
from lxml import etree


text = """<sitemap xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <loc>http://www.some_page.com/sitemap-page-2010-11.xml</loc>
    <lastmod>2011-12-22T15:46:17+00:00</lastmod>
</sitemap>"""

body = cStringIO.StringIO(text)
parser = etree.XMLParser(recover=True, load_dtd=True, ns_clean=True)
tree = etree.parse(body, parser)

for item in tree.xpath("./*"):
    if 'loc' in item.tag:
        print item.text

prints

http://www.some_page.com/sitemap-page-2010-11.xml

Hope that helps.

Upvotes: 1

tau

Reputation: 11

i'm not sure if you meant removing the tag and leaving the text. so it goes another answer.

from ehp import *

data = '''
<sitemap xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <loc>http://www.some_page.com/sitemap-page-2010-11.xml</loc>
  <lastmod>2011-12-22T15:46:17+00:00</lastmod>
</sitemap>'''

html = Html()
dom  = html.feed(data)

for root, ind in dom.find_with_root('loc'):
    root.remove(ind)
    root.append(Data(ind.text()))


# It would give me.
print dom



""" <sitemap xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" >

  <lastmod >2011-12-22T15:46:17+00:00</lastmod>
http://www.some_page.com/sitemap-page-2010-11.xml</sitemap>
"""

Upvotes: 0

tau

Reputation: 11

I would try this with this tool.

htmlparser.sourceforge.net/

a friend told me it was simple and indeed !! much better than beautifulsoup or anything like.

from ehp import *

data = '''
<sitemap xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <loc>http://www.some_page.com/sitemap-page-2010-11.xml</loc>
  <lastmod>2011-12-22T15:46:17+00:00</lastmod>
</sitemap>'''

html = Html()
dom  = html.feed(data)
seq  = [ind.text() for ind in dom.find('loc')]

print seq

# It gives me.
# ['http://www.some_page.com/sitemap-page-2010-11.xml']

Upvotes: 0

Steve Barnes

Reputation: 28370

In a perfect world you would use an XML parsing or html scraping library to parse your html to make sure you have the exact tags that you need, in context. It is almost certainly simpler, quicker and good enough in this case to simply use a regular expression to match what you need.

>>> import re
>>> samp = """<sitemap xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
...     <loc>http://www.some_page.com/sitemap-page-2010-11.xml</loc>
...     <lastmod>2011-12-22T15:46:17+00:00</lastmod>
... </sitemap>"""
>>> re.findall(r'<loc>(.*)</loc>', samp)
['http://www.some_page.com/sitemap-page-2010-11.xml']

Upvotes: 2

LXML: remove the x

Answers (4)

Related Questions