inverted_index
inverted_index

Reputation: 2427

Get the entire parent tag's text in ElementTree

While using xml.etree.ElementTree as ET python package, I would like to get the entire text within an XML tag, which contains some child nodes. Consider the following xml:

<p>This is the start of parent tag...
        <ref type="chlid1">child 1</ref>. blah1 blah1 blah1 <ref type="chlid2">child2</ref> blah2 blah2 blah2 
</p>

Assuming that the above XML is in node, then node.text would just give me This is the start of parent tag.... However, I want to capture all of the text inside p tag (along with its child tag's texts) which would result in: This is the start of parent tag... child 1. blah1 blah1 blah1 child2 blah2 blah2 blah2.

Is there any work-around for this issue? I looked into the documentation but couldn't really find something that works out.

Upvotes: 1

Views: 1494

Answers (2)

Jack Fleeting
Jack Fleeting

Reputation: 24930

You can do something similar with ElementTree:

import xml.etree.ElementTree as ET
data = """[your string above]"""
tree = ET.fromstring(data)
print(' '.join(tree.itertext()).strip())

Output:

This is the start of parent tag...
         child 1 . blah1 blah1 blah1  child2  blah2 blah2 blah2

Upvotes: 2

Mathias M&#252;ller
Mathias M&#252;ller

Reputation: 22617

This is indeed a very awkward peculiarity of ElementTree. The gist is: if an element contains both text and child elements, and if a child element intervenes between different intermediate text nodes, the text after the child element is said to be this element's tail instead of its text.

In order to collect all text that is an immediate child or descendant of an element, you would need to access the text and tail of this element, and of all descendant elements.

>>> from lxml import etree

>>> s = '<p>This is the start of parent tag...<ref type="chlid1">child 1</ref>. blah1 blah1 blah1 <ref type="chlid2">child2</ref> blah2 blah2 blah2 </p>'

>>> root = etree.fromstring(s)
>>> child1, child2 = root.getchildren()

>>> root.text
'This is the start of parent tag...'

>>> child1.text, child1.tail
('child 1', '. blah1 blah1 blah1 ')

>>> child2.text, child2.tail
('child2', ' blah2 blah2 blah2 ')

As for a complete solution, I discovered that this answer is doing something very similar, that you can easily adapt to your usecase (by not printing the name of elements).


Edit: actually, the simplest solution by far, in my opinion, is to use itertext:

>>> ''.join(root.itertext())
'This is the start of parent tag...child 1. blah1 blah1 blah1 child2 blah2 blah2 blah2 '

Upvotes: 1

Related Questions