jorbas
jorbas

Reputation: 311

Why does this element in lxml include the tail?

Consider this Python script:

from lxml import etree

html = '''
<html xmlns="http://www.w3.org/1999/xhtml">
<head></head>
  <body>
    <p>This is some text followed with 2 citations.<span class="footnote">1</span>
       <span сlass="footnote">2</span>This is some more text.</p>
  </body>
</html>'''

tree = etree.fromstring(html)

for element in tree.findall(".//{*}span"):
    if element.get("class") == 'footnote':
        print(etree.tostring(element, encoding="unicode", pretty_print=True))

The desired output would be the 2 span elements, instead I get:

<span xmlns="http://www.w3.org/1999/xhtml" class="footnote">1</span>
<span xmlns="http://www.w3.org/1999/xhtml" class="footnote">2</span>This is some more text.

Why does it include the text after the element until the end of the parent element?

I'm trying to use lxml to link footnotes and when I a.insert() the span element into the a element I create for it, it's including the text after and so linking large amounts of text I don't want linked.

Upvotes: 8

Views: 2010

Answers (2)

Lennart Regebro
Lennart Regebro

Reputation: 172437

It includes the text after the element, because that text belongs to the element.

If you don't want that text to belong to the previous span, it needs to be contained in it's own element. However, you can avoid printing this text when converting the element back to XML with with_tail=False as a parameter to etree.tostring().

You can also simply set the elements tail to '' if you want to remove it from a specific element.

Upvotes: 2

falsetru
falsetru

Reputation: 369474

Specifying with_tail=False will remove the tail text.

print(etree.tostring(element, encoding="unicode", pretty_print=True, with_tail=False))

See lxml.etree.tostring documentation.

Upvotes: 5

Related Questions