miracle2k
miracle2k

Reputation: 32047

Python: Extract HTML from an XML file

My XML file looks like this:

 <strings>
      <string>Bla <b>One &amp; Two</b> Foo</string>
 </strings>

I want to extract the content of each <string> while maintaining the inner tags. That is, I would like to see the following Python string: u"Bla <b>One & Two</b> Foo". Alternatively, I guess I could settle on u"Bla <b>One & Two</b> Foo", and then try to replace the entities myself.

I am currently using lxml, which allows me to iterate over the nested tags, missing out on the text not inside a tag, or alternatively over all text content (itertext), losing the tag information. I'm probably missing something.

If possible I'd prefer to keep lxml, though I can switch to another library if necessary.

Upvotes: 0

Views: 1445

Answers (4)

Robert Rossney
Robert Rossney

Reputation: 96730

There may be a better way of conditionally handling objects returned by the xpath() function, but I'm not sufficiently conversant with lxml to know what it is, so I had to write a function to return the text value of a node. But that said, this shows a general approach to the problem:

>>> from lxml import etree
>>> from StringIO import StringIO
>>> def node_text(n):
        try:
            return etree.tostring(n, method='html', with_tail=False)
        except TypeError:
            return str(n)

>>> f = StringIO('<strings><string>This is <b>not</b> how I plan to escape.</string></strings>')
>>> x = etree.parse(f)
>>> ''.join(node_text(n) for n in x.xpath('/strings/string/node()'))
'This is <b>not</b> how I plan to escape.'

Upvotes: 3

cobbal
cobbal

Reputation: 70743

try etree.tostring

outer = etree.tostring(string_elem, method='html')
inner = re.match("^[^>]+>(.*)<[^<]+$", outer).groups(1)[0]

Upvotes: 2

ghostdog74
ghostdog74

Reputation: 342423

Not using parser, but just pure string manipulation

mystring="""
 <strings>
      <string>Bla <b>One &amp; Two</b> Foo</string>
 </strings>
"""
for s in mystring.split("</string>"):
    if "<string>" in s:
        i = s.index("<string>")
        print s[i+len("<string>"):].replace("&amp;","")

Upvotes: -1

wizzard0
wizzard0

Reputation: 1928

Regardless of the language, relatively simple XSLT template would do the trick.

Something like defining patterns to tags you want to keep, converting to text others.

You can of course use a recursive function with a compliant DOM implementation (minidom maybe?) and process tags by hand.

(pseudocode)

def Function(tag):
   if tag.NodeType = "#text": return tag.innerText
   text=""
   if tag.ElementName in allowedTags:
       text="<%s>"%tag.ElementName
   text += [Function(subtag) for subtag in tag.childs]
   if tag.ElementName in allowedTags:
       text+="</%s>"%tag.ElementName
   return text

Upvotes: 0

Related Questions