Reputation: 2427
While using xml.etree.ElementTree as ET
python package, I would like to get the entire text within an XML tag, which contains some child nodes. Consider the following xml:
<p>This is the start of parent tag...
<ref type="chlid1">child 1</ref>. blah1 blah1 blah1 <ref type="chlid2">child2</ref> blah2 blah2 blah2
</p>
Assuming that the above XML is in node
, then node.text
would just give me This is the start of parent tag...
. However, I want to capture all of the text inside p
tag (along with its child tag's texts) which would result in: This is the start of parent tag... child 1. blah1 blah1 blah1 child2 blah2 blah2 blah2
.
Is there any work-around for this issue? I looked into the documentation but couldn't really find something that works out.
Upvotes: 1
Views: 1494
Reputation: 24930
You can do something similar with ElementTree:
import xml.etree.ElementTree as ET
data = """[your string above]"""
tree = ET.fromstring(data)
print(' '.join(tree.itertext()).strip())
Output:
This is the start of parent tag...
child 1 . blah1 blah1 blah1 child2 blah2 blah2 blah2
Upvotes: 2
Reputation: 22617
This is indeed a very awkward peculiarity of ElementTree. The gist is: if an element contains both text and child elements, and if a child element intervenes between different intermediate text nodes, the text after the child element is said to be this element's tail
instead of its text
.
In order to collect all text that is an immediate child or descendant of an element, you would need to access the text
and tail
of this element, and of all descendant elements.
>>> from lxml import etree
>>> s = '<p>This is the start of parent tag...<ref type="chlid1">child 1</ref>. blah1 blah1 blah1 <ref type="chlid2">child2</ref> blah2 blah2 blah2 </p>'
>>> root = etree.fromstring(s)
>>> child1, child2 = root.getchildren()
>>> root.text
'This is the start of parent tag...'
>>> child1.text, child1.tail
('child 1', '. blah1 blah1 blah1 ')
>>> child2.text, child2.tail
('child2', ' blah2 blah2 blah2 ')
As for a complete solution, I discovered that this answer is doing something very similar, that you can easily adapt to your usecase (by not printing the name of elements).
Edit: actually, the simplest solution by far, in my opinion, is to use itertext:
>>> ''.join(root.itertext())
'This is the start of parent tag...child 1. blah1 blah1 blah1 child2 blah2 blah2 blah2 '
Upvotes: 1