Reputation: 131
I have a file that has HTML within XML tags and I want that HTML as raw text, rather than have it be parsed as children of the XML tag. Here's an example:
import xml.etree.ElementTree as ET
root = ET.fromstring("<root><text><p>This is some text that I want to read</p></text></root>")
If i try:
root.find('text').text
It returns no output
but root.find('text/p').text will return the paragraph text without the tags. I want everything within the text tag as raw text, but I can't figure out how to get this.
Upvotes: 0
Views: 2869
Reputation: 345
Above solutions will miss initial part of your html if your content begins with text. E.g.
<root><text>This is <i>some text</i> that I want to read</text></root>
You can do that:
node = root.find('text')
output_list = [node.text] if node.text else []
output_list += [ET.tostring(child, encoding="unicode") for child in node]
output_text = ''.join(output_list)
Upvotes: 0
Reputation: 20792
Your solution is reasonable. An element object is the list of children. The .text
attribute of the element object is related only to things (usually a text) that are not part of other (nested) elements.
There are things to be improved in your code. In Python, string concatenation is an expensive operation. It is better to build the list of substrings and to join them later -- like this:
output_lst = []
for child in root.find('text'):
output_lst.append(ET.tostring(child, encoding="unicode"))
output_text = ''.join(output_lst)
The list can be also build using the Python list comprehension construct, so the code would change to:
output_lst = [ET.tostring(child, encoding="unicode") for child in root.find('text')]
output_text = ''.join(output_lst)
The .join
can consume any iterable that produces strings. This way the list need not to be constructed in advance. Instead, a generator expression (that is what can be seen inside the []
of the list comprehension) can be used:
output_text = ''.join(ET.tostring(child, encoding="unicode") for child in root.find('text'))
The one-liner can be formatted to more lines to make it more readable:
output_text = ''.join(ET.tostring(child, encoding="unicode")
for child in root.find('text'))
Upvotes: 2
Reputation: 131
I was able to get what I wanted by appending all child elements of my text tag to a string using ET.tostring:
output_text = ""
for child in root.find('text'):
output_text += ET.tostring(child, encoding="unicode")
>>>output_text
>>>"<p>This is some text that I want to read</p>"
Upvotes: 1