Loïc Faure-Lacroix
Loïc Faure-Lacroix

Reputation: 13600

Is it possible to treat text as xml element with lxml?

I'd like to filter an element tree to remove duplicate element entry. In short, I'm trying to clean an xml output to something that can be parsed by a different tool.

For example

<p>
  <p>
    Text node 1
    <ul>
      <li>asdasd</li>
    </ul>  
    <p>
      Text node 2 <span>Som text</span>
    </p>
    Text node 3
  </p>
  <p>Text node 4</p>
</p>

Would be converted to this:

<p>
  Text node 1
  <ul>
  <li>asdasd</li>
  </ul>
</p>
<p>Text node 2 <span>Som text</span></p>
<p>Text node 3</p>
<p>Text node 4</p>

In lxml, getchildren only seem to return xml Elements. So when I call getchildren on the p containing the ul. It will return a list like [ul, p] thought, I'd want to have a list containing:

[Text, Ul, P, Text] So I can easily walk down or up the tree to reduce the superfluous elements.

Upvotes: 2

Views: 373

Answers (1)

Chris Doyle
Chris Doyle

Reputation: 12027

The documentation of lxml suggests that they dont have a text node, and that text will either be part of that element accessed through the .text or will be tail of a closing tag accessed through the .tail.

<html><body>Hello<br/>World</body></html>

Here, the <br/> tag is surrounded by text. This is often referred to as document-style or mixed-content XML. Elements support this through their tail property. It contains the text that directly follows the element, up to the next element in the XML tree.

The two properties .text and .tail are enough to represent any text content in an XML document. This way, the ElementTree API does not require any special text nodes in addition to the Element class, that tend to get in the way fairly often (as you might know from classic DOM APIs).

I cant say the below is pretty or exactly what you want but might at least put you on a closer direction.

from lxml import etree

tree = etree.parse("test.dat").getroot()
main_p = tree[0]
elements = [main_p.text]
for child in main_p:
    elements.append(child.tag)
    elements.append(child.tail)
    print(f"TAG: {child.tag} has tail: #{child.tail}#")

print(elements)

OUTPUT

TAG: ul has tail: #
    #
TAG: p has tail: #
    Text node 3
  #
['\n    Text node 1\n    ', 'ul', '\n    ', 'p', '\n    Text node 3\n  ']

So "Text node 1" is the text of the main p. but "Text node 3" while its inside the main p is actually a tail tag of the inner p.

As an additional to this you can iterate over the main p elelment and if the child element is a p tag you can move it out of main p and add it in the root tag. again below is just an example.

from lxml import etree

tree = etree.parse("test.dat").getroot()
main_p = tree[0]
elements = [main_p.text]
for child in main_p[::-1]:
    if child.tag == 'p':
        tree.insert(tree.index(main_p) + 1, child)
        new_p = etree.Element('p')
        new_p.text = child.tail
        tree.insert(tree.index(child)+1, new_p)
        child.tail = "\n"

tree.tag = 'something_else'
print(etree.tostring(tree, pretty_print=True).decode('utf-8'))

OUTPUT

<something_else>
   <p>
      Text node 1
      <ul>
         <li>asdasd</li>
      </ul>
   </p>
   <p>
      Text node 2
      <span>Som text</span>
   </p>
   <p>Text node 3</p>
   <p>Text node 4</p>
</something_else>

Upvotes: 2

Related Questions