lxml removes unwrapped text inside tag

Question

Here is my python code with lxml

import urllib.request
from lxml import etree
#import lxml.html as html
from copy import deepcopy
from lxml import etree
from lxml import html


some_xml_data = "text1ddd
text2ddd
text3"
root = etree.fromstring(some_xml_data)
[c] = root.xpath('//span')
print(etree.tostring(root))  #b'text1ddd
text2ddd
text3' #output as expected
#but if i do some changes
for e in c.iterchildren("*"):
    if e.tag == 'div':
        e.getparent().remove(e)

print(etree.tostring(root)) #b'text1' text2 and text3 removed! how to prevent this deletion?

It looks like after I do some changes on lxml tree (delete some tags) lxml also remove some unwrapped text! how to prevent lxml doing this and save unwrpapped text?

Anzel · Accepted Answer

The text after node is called tail, and they can be reserved by appending to parent's text, here is a sample:

In [1]: from lxml import html

In [2]: s = "text1ddd
text2ddd
text3"
   ...: 

In [3]: tree = html.fromstring(s)

In [4]: for node in tree.iterchildren("div"):
   ...:     if node.tail:
   ...:         node.getparent().text += node.tail
   ...:     node.getparent().remove(node)
   ...:     

In [5]: html.tostring(tree)
Out[5]: b'text1text2text3'

I use html as it's more likely the structure than xml. And you can simply iterchildren with div to avoid additional check for tag.

lxml removes unwrapped text inside tag

Answers (1)

Related Questions