Is it possible to treat text as xml element with lxml?

Question

I'd like to filter an element tree to remove duplicate element entry. In short, I'm trying to clean an xml output to something that can be parsed by a different tool.

For example


  

    Text node 1
    

      asdasd
      
    
      Text node 2 Som text
    
    Text node 3
  
  Text node 4

Would be converted to this:


  Text node 1
  

  asdasd
  

Text node 2 Som text
Text node 3
Text node 4

In lxml, getchildren only seem to return xml Elements. So when I call getchildren on the p containing the ul. It will return a list like [ul, p] thought, I'd want to have a list containing:

[Text, Ul, P, Text] So I can easily walk down or up the tree to reduce the superfluous elements.

Chris Doyle · Accepted Answer

The documentation of lxml suggests that they dont have a text node, and that text will either be part of that element accessed through the .text or will be tail of a closing tag accessed through the .tail.

Hello World

Here, the tag is surrounded by text. This is often referred to as document-style or mixed-content XML. Elements support this through their tail property. It contains the text that directly follows the element, up to the next element in the XML tree.

The two properties .text and .tail are enough to represent any text content in an XML document. This way, the ElementTree API does not require any special text nodes in addition to the Element class, that tend to get in the way fairly often (as you might know from classic DOM APIs).

I cant say the below is pretty or exactly what you want but might at least put you on a closer direction.

from lxml import etree

tree = etree.parse("test.dat").getroot()
main_p = tree[0]
elements = [main_p.text]
for child in main_p:
    elements.append(child.tag)
    elements.append(child.tail)
    print(f"TAG: {child.tag} has tail: #{child.tail}#")

print(elements)

OUTPUT

TAG: ul has tail: #
    #
TAG: p has tail: #
    Text node 3
  #
['
    Text node 1
    ', 'ul', '
    ', 'p', '
    Text node 3
  ']

So "Text node 1" is the text of the main p. but "Text node 3" while its inside the main p is actually a tail tag of the inner p.

As an additional to this you can iterate over the main p elelment and if the child element is a p tag you can move it out of main p and add it in the root tag. again below is just an example.

from lxml import etree

tree = etree.parse("test.dat").getroot()
main_p = tree[0]
elements = [main_p.text]
for child in main_p[::-1]:
    if child.tag == 'p':
        tree.insert(tree.index(main_p) + 1, child)
        new_p = etree.Element('p')
        new_p.text = child.tail
        tree.insert(tree.index(child)+1, new_p)
        child.tail = "
"

tree.tag = 'something_else'
print(etree.tostring(tree, pretty_print=True).decode('utf-8'))

OUTPUT


   
      Text node 1
      

         asdasd
      
   
   
      Text node 2
      Som text
   
   Text node 3
   Text node 4

Is it possible to treat text as xml element with lxml?

Answers (1)

Related Questions