Reputation: 35
<root>
<article>
<front>
<body>
<back>
<sec id="sec7" sec-type="funding">
<title>Funding</title>
<p>This work was supported by the NIH</p>
</sec>
</back>
I have an XML file of scientific journal metadata and am trying to extract just the funding information for each article. I need the info contained within the p
tag. While the "sec id" varies between article, the "sec-type" is always "funding".
I have been trying to do this in Python3 using Element Tree.
import xml.etree.ElementTree as ET
tree = ET.parse(journals.xml)
root = tree.getroot()
for title in root.iter("title"):
ET.dump(title)
Any help would be greatly appreciated!
Upvotes: 0
Views: 84
Reputation: 11157
You can use findall
with an XPath expression to extract the values you want. I extrapolated from your example data a little bit in order to complete the document and have two p
elements:
<root>
<article>
<front>
<body>
<back>
<sec id="sec7" sec-type="funding">
<title>Funding</title>
<p>This work was supported by the NIH</p>
</sec>
<sec id="sec8" sec-type="funding">
<title>Funding</title>
<p>I'm a little teapot</p>
</sec>
</back>
</body>
</front>
</article>
</root>
The following extracts all of the text contents of p
nodes under a sec
node where sectype="funding"
:
import xml.etree.ElementTree as ET
doc = ET.parse('journals.xml')
print([p.text for p in doc.findall('.//sec[@sec-type="funding"]/p')])
Result:
['This work was supported by the NIH', "I'm a little teapot"]
Upvotes: 2