Parsing lxml.etree._Element contents

Question

I have the following element that I parsed out of a

I am trying to extract "55488 Power La Vaca (8025K) Linux 4.2.x.x" from this element (including the spaces).

import lxml.etree as ET
td_html = """


"""

td_elem = ET.fromstring(td_html)

fail_1 = td_elem.find('a').text + td_elem.text
print "FAIL_1", fail_1

print "FAIL_2"
for elem in td_elem.iterchildren():
    print elem.tag, elem.text

Results

$ python textxml.py

FAIL_1
    5548U


FAIL_2
a
    5548U

br None
br None
br None
br None
$

Question

It is humbling that I have to ask this question, since it doesn't seem like it should be hard.

How can I extract "Power La Vaca (8025K) Linux 4.2.x.x" from the td_elem element (including the spaces)?

Please, no regexp solutions.

Solution

The explicit solution (using Finn's suggestion of itertext()):

import lxml.etree as ET
td_html = """


"""

td_elem = ET.fromstring(td_html)
print "SUCCESS", ' '.join([txt.strip() for txt in td_elem.itertext()])

5548U
Power La Vaca
(M8025K)
Linux 4.2.x.x

Finn · Accepted Answer

I know there must be a better way but this works.

link = td_elem.find('a').text.strip()
text = ''.join(td_elem.itertext()).strip()
text.split(link)[1]

Output is Power La Vaca(M8025K)Linux 4.2.x.x

Update: This is actually better if you want spaces in place of those s

' '.join(map(str, [el.tail for el in td_elem.iterchildren() if el.tail]))

The map str isn't actually needed for this but I can imagine other values for which it would be.

Parsing lxml.etree._Element contents

Results

Question

Solution

Answers (2)

Related Questions