Rafael Almeida
Rafael Almeida

Reputation: 10732

HTML inside node using ElementTree

I am using ElementTree to parse a XML file. In some fields, there will be HTML data. For example, consider a declaration as follows:

<Course>
    <Description>Line 1<br />Line 2</Description>
</Course>

Now, supposing _course is an Element variable which hold this Couse element. I want to access this course's description, so I do:

desc = _course.find("Description").text;

But then desc only contains "Line 1". I read something about the .tail attribute, so I tried also:

desc = _course.find("Description").tail;

And I get the same output. What should I do to make desc be "Line 1
Line 2" (or literally anything between and )? In other words, I'm looking for something similar to the .innerText property in C# (and many other languages I guess).

Upvotes: 4

Views: 3002

Answers (4)

Dan-Dev
Dan-Dev

Reputation: 9430

You are trying to read the tail attribute from the wrong element. Try

desc = _course.find("br").tail;

The tail attribute is used to store trailing text nodes when reading mixed-content XML files; text that follows directly after an element are stored in the tail attribute for that element:

    <tag><elem>this goes into elem's
    text attribute</elem>this goes into
    elem's tail attribute</tag>

Simple code snippet to print text and tail attributes from all elements in xml/xhtml.

import xml.etree.ElementTree as ET

def processElem(elem):
    if elem.text is not None:
        print elem.text
    for child in elem:
        processElem(child)
        if child.tail is not None:
            print child.tail

xml = '''<Course>
    <Description>Line 1<br />Line 2 <span>child text </span>child tail</Description>
    </Course>'''

root = ET.fromstring(xml)
processElem(root)

Output:

Line 1
Line 2 
child text 
child tail

See http://code.activestate.com/recipes/498286-elementtree-text-helper/ for a better solution. It can be modified to suit.

P.S. I changed my name from user839338 as quoted in the next post

Upvotes: 3

SingleNegationElimination
SingleNegationElimination

Reputation: 156238

Inspired by user839338's answer, I wen't and looked for a reasonable solution, which looks a bit like this.

>>> from xml.etree import ElementTree as etree
>>> corpus = '''<Course>
...     <Description>Line 1<br />Line 2</Description>
... </Course>'''
>>> 
>>> doc = etree.fromstring(corpus)
>>> desc = doc.find("Description")
>>> desc.tag = 'html'
>>> etree.tostring(desc)
'<html>Line 1<br/>Line 2</html>\n'
>>> 

There's no simple way to eliminate the surrounding tag (originally <Description>), but it's easily modified into something that could be used as needed, for instance <div> or <span>

Upvotes: 1

Dana the Sane
Dana the Sane

Reputation: 15198

Do you have any control over the creation of the xml file? The contents of xml tags which contain xml tags (or similar), or markup chars ('<', etc) should be encoded to avoid this problem. You can do this with either:

  • a CDATA section
  • Base64 or some other encoding (which doesn't include xml reserved characters)
  • Entity encoding ('<' == '&lt;')

If you can't make these changes, and ElementTree can't ignore tags not included in the xml schema, then you will have to pre-process the file. Of course, you're out of luck if the schema overlaps html.

Upvotes: 3

ylebre
ylebre

Reputation: 3130

Characters like "<" and "&" are illegal in XML elements.

"<" will generate an error because the parser interprets it as the start of a new element.

"&" will generate an error because the parser interprets it as the start of an character entity.

Some text, like JavaScript code, contains a lot of "<" or "&" characters. To avoid errors script code can be defined as CDATA.

Everything inside a CDATA section is ignored by the parser.

A CDATA section starts with "":

More information on: http://www.w3schools.com/xmL/xml_cdata.asp

Hope this helps!

Upvotes: 1

Related Questions