Reputation: 10732
I am using ElementTree to parse a XML file. In some fields, there will be HTML data. For example, consider a declaration as follows:
<Course>
<Description>Line 1<br />Line 2</Description>
</Course>
Now, supposing _course is an Element variable which hold this Couse element. I want to access this course's description, so I do:
desc = _course.find("Description").text;
But then desc only contains "Line 1". I read something about the .tail attribute, so I tried also:
desc = _course.find("Description").tail;
And I get the same output. What should I do to make desc be "Line 1
Line 2" (or literally anything between and )? In other words, I'm looking for something similar to the .innerText property in C# (and many other languages I guess).
Upvotes: 4
Views: 3002
Reputation: 9430
You are trying to read the tail attribute from the wrong element. Try
desc = _course.find("br").tail;
The tail attribute is used to store trailing text nodes when reading mixed-content XML files; text that follows directly after an element are stored in the tail attribute for that element:
<tag><elem>this goes into elem's text attribute</elem>this goes into elem's tail attribute</tag>
Simple code snippet to print text and tail attributes from all elements in xml/xhtml.
import xml.etree.ElementTree as ET def processElem(elem): if elem.text is not None: print elem.text for child in elem: processElem(child) if child.tail is not None: print child.tail xml = '''<Course> <Description>Line 1<br />Line 2 <span>child text </span>child tail</Description> </Course>''' root = ET.fromstring(xml) processElem(root)
Output:
Line 1 Line 2 child text child tail
See http://code.activestate.com/recipes/498286-elementtree-text-helper/ for a better solution. It can be modified to suit.
P.S. I changed my name from user839338 as quoted in the next post
Upvotes: 3
Reputation: 156238
Inspired by user839338's answer, I wen't and looked for a reasonable solution, which looks a bit like this.
>>> from xml.etree import ElementTree as etree
>>> corpus = '''<Course>
... <Description>Line 1<br />Line 2</Description>
... </Course>'''
>>>
>>> doc = etree.fromstring(corpus)
>>> desc = doc.find("Description")
>>> desc.tag = 'html'
>>> etree.tostring(desc)
'<html>Line 1<br/>Line 2</html>\n'
>>>
There's no simple way to eliminate the surrounding tag (originally <Description>
), but it's easily modified into something that could be used as needed, for instance <div>
or <span>
Upvotes: 1
Reputation: 15198
Do you have any control over the creation of the xml file? The contents of xml tags which contain xml tags (or similar), or markup chars ('<
', etc) should be encoded to avoid this problem. You can do this with either:
<
' ==
'<
')If you can't make these changes, and ElementTree can't ignore tags not included in the xml schema, then you will have to pre-process the file. Of course, you're out of luck if the schema overlaps html.
Upvotes: 3
Reputation: 3130
Characters like "<" and "&" are illegal in XML elements.
"<" will generate an error because the parser interprets it as the start of a new element.
"&" will generate an error because the parser interprets it as the start of an character entity.
Some text, like JavaScript code, contains a lot of "<" or "&" characters. To avoid errors script code can be defined as CDATA.
Everything inside a CDATA section is ignored by the parser.
A CDATA section starts with "":
More information on: http://www.w3schools.com/xmL/xml_cdata.asp
Hope this helps!
Upvotes: 1