Reputation: 131

How to parse HTML tags as raw text using ElementTree

I have a file that has HTML within XML tags and I want that HTML as raw text, rather than have it be parsed as children of the XML tag. Here's an example:

import xml.etree.ElementTree as ET
root = ET.fromstring("<root><text><p>This is some text that I want to read</p></text></root>")

If i try:

root.find('text').text

It returns no output

but root.find('text/p').text will return the paragraph text without the tags. I want everything within the text tag as raw text, but I can't figure out how to get this.

Upvotes: 0

Answers (3)

soysal

Reputation: 345

Above solutions will miss initial part of your html if your content begins with text. E.g.

<root><text>This is <i>some text</i> that I want to read</text></root>

You can do that:

node = root.find('text')
output_list = [node.text] if node.text else []
output_list += [ET.tostring(child, encoding="unicode") for child in node]
output_text = ''.join(output_list)

Upvotes: 0

pepr

Reputation: 20792

Your solution is reasonable. An element object is the list of children. The .text attribute of the element object is related only to things (usually a text) that are not part of other (nested) elements.

There are things to be improved in your code. In Python, string concatenation is an expensive operation. It is better to build the list of substrings and to join them later -- like this:

output_lst = []  
for child in root.find('text'):
    output_lst.append(ET.tostring(child, encoding="unicode"))

output_text = ''.join(output_lst)

The list can be also build using the Python list comprehension construct, so the code would change to:

output_lst = [ET.tostring(child, encoding="unicode") for child in root.find('text')]  
output_text = ''.join(output_lst)

The .join can consume any iterable that produces strings. This way the list need not to be constructed in advance. Instead, a generator expression (that is what can be seen inside the [] of the list comprehension) can be used:

output_text = ''.join(ET.tostring(child, encoding="unicode") for child in root.find('text'))

The one-liner can be formatted to more lines to make it more readable:

output_text = ''.join(ET.tostring(child, encoding="unicode")
                      for child in root.find('text'))

Upvotes: 2

seitzej

Reputation: 131

I was able to get what I wanted by appending all child elements of my text tag to a string using ET.tostring:

output_text = ""    
for child in root.find('text'):
    output_text += ET.tostring(child, encoding="unicode")

>>>output_text
>>>"<p>This is some text that I want to read</p>"

Upvotes: 1

How to parse HTML tags as raw text using ElementTree

Answers (3)

Related Questions