Slowat_Kela
Slowat_Kela

Reputation: 1511

Iterate through all sub-tags and strings from an XML tag in python, without specifying sub-tag name

My question is an add on from here, but I'm not meant to use the answer section for add-on questions.

If I have part of an XML file like this:

  <eligibility>
    <criteria>
      <textblock>
        Inclusion Criteria:

          -  women undergoing cesarean section for any indication

          -  literate in german language

        Exclusion Criteria:

          -  history of keloids

          -  previous transversal suprapubic scars

          -  known patient hypersensitivity to any of the suture materials used in the protocol

          -  a medical disorder that could affect wound healing (eg, diabetes mellitus, chronic
             corticosteroid use)
      </textblock>
    </criteria>
    <gender>Female</gender>
    <minimum_age>18 Years</minimum_age>
    <maximum_age>45 Years</maximum_age>
    <healthy_volunteers>No</healthy_volunteers>
  </eligibility>

I want to pull out all of the strings in this eligibility section (i.e the string in the textblock section and the gender, minimum age, maximum age and healthy volunteers sections)

using the code above I did this:

import sys
from bs4 import BeautifulSoup

soup = BeautifulSoup(open(sys.argv[1], 'r'), 'lxml')
eligibi = []

for eligibility in soup.find_all('eligibility'):
    d = {'other_name':eligibility.criteria.textblock.string, 'gender':eligibility.gender.string}
    eligibi.append(d)

print eligibi

My problem is I have many files. Sometimes the structure of the XML file might be:

eligibility -> criteria -> textblock -> text
eligibility -> other things (e.g. gender as above) -> text
eligibility -> text

e.g. if there way to just take 'take all of the sub-headings and their texts'

so in the above example, the list/dictionary would contain: {criteria textblock: inclusion and exclusion criteria, gender: xxx, minimum_age: xxx, maximum_age: xxx, healthy_volunteers: xxx}

My problem is, in reality, I am not going to know all the specific sub-tags of the eligibility tag, as each experiment could be different (e.g. maybe some say 'pregnant women accepted', 'drug history of XXX accepted' etc)

So I just want, if I give it a tag name, it will give me all the sub-tags and text of those sub-tags in a dictionary.

Extended XML for comment:

<brief_title>Subcutaneous Adaption and Cosmetic Outcome Following Caesarean Delivery</brief_title>
<source>Klinikum Klagenfurt am Wörthersee</source>

...and then the eligibility XML section above.

Upvotes: 0

Views: 1266

Answers (1)

har07
har07

Reputation: 89295

Since you have lxml installed you can try the following (this code assumes leaf elements within a given element i.e eligibility are unique) :

from lxml import etree
tree = etree.parse(sys.argv[1])
root = tree.getroot()

eligibi = []

for eligibility in root.xpath('//eligibility'):
    d = {}
    for e in eligibility.xpath('.//*[not(*)]'):
        d[e.tag] = e.text
    eligibi.append(d)

print eligibi

XPath explanation :

  • .//* : find all elements within current eligibility, no matter its depth (//) and tag name (*)
  • [not(*)] : filter elements found by the previous bit to those that don't have any child element aka leaf elements

Upvotes: 1

Related Questions