Reputation: 402
I'm new to xml data processing. I want to extract the text data in the following xml file:
<data>
<p>12345<strong>45667</strong>abcde</p>
</data>
so that expected result is:
['12345','45667', 'abcde']
Currently I have tried:
tree = ET.parse('data.xml')
data = tree.getiterator()
text = [data[i].text for i in range(0, len(data))]
But the result only shows ['12345','45667']
. 'abcde'
is missing. Can someone help me? Thanks in advance!
Upvotes: 3
Views: 3496
Reputation: 184955
Try doing this using xpath and lxml :
import lxml.etree as etree
string = '''
<data>
<p>12345<strong>45667</strong>abcde</p>
</data>
'''
tree = etree.fromstring(string)
print(tree.xpath('//p//text()'))
The Xpath expression means: "select all p elements wich containing text recursively"
['12345', '45667', 'abcde']
Upvotes: 2
Reputation: 473753
getiterator()
(or it's replacement iter()
) iterates over child tags/elements, while abcde
is a text node, a tail
of the strong
tag.
You can use itertext()
method:
import xml.etree.ElementTree as ET
tree = ET.parse('test.xml')
print list(tree.find('p').itertext())
Prints:
['12345', '45667', 'abcde']
Upvotes: 2