Lil_Joe
Lil_Joe

Reputation: 35

Finding specific XML attribute of child element using Python?

<root>
  <article>
    <front>
      <body>
        <back>
          <sec id="sec7" sec-type="funding">
            <title>Funding</title>
            <p>This work was supported by the NIH</p>
          </sec>
        </back>

I have an XML file of scientific journal metadata and am trying to extract just the funding information for each article. I need the info contained within the p tag. While the "sec id" varies between article, the "sec-type" is always "funding".

I have been trying to do this in Python3 using Element Tree.

import xml.etree.ElementTree as ET  

tree = ET.parse(journals.xml)
root = tree.getroot()
for title in root.iter("title"):
    ET.dump(title)

Any help would be greatly appreciated!

Upvotes: 0

Views: 84

Answers (1)

cody
cody

Reputation: 11157

You can use findall with an XPath expression to extract the values you want. I extrapolated from your example data a little bit in order to complete the document and have two p elements:

<root>
  <article>
    <front>
      <body>
        <back>
          <sec id="sec7" sec-type="funding">
            <title>Funding</title>
            <p>This work was supported by the NIH</p>
          </sec>
          <sec id="sec8" sec-type="funding">
            <title>Funding</title>
            <p>I'm a little teapot</p>
          </sec>
        </back>
      </body>
    </front>
  </article>
</root>

The following extracts all of the text contents of p nodes under a sec node where sectype="funding":

import xml.etree.ElementTree as ET

doc = ET.parse('journals.xml')
print([p.text for p in doc.findall('.//sec[@sec-type="funding"]/p')])

Result:

['This work was supported by the NIH', "I'm a little teapot"]

Upvotes: 2

Related Questions