Extracting nested XML elements of different sizes into Pandas

Question

Lets assume we have an arbitrary XML document like below



   
      Organization 1
      academic bachelor
      academic master
      Here is some text; blablabla
      Scrum master
   
   
      bachelor
      academic master
      academic bachelor
      Organization 2
      Text from another organization about some stuff.
      Excutives
   
   
      Organization 3
      Also another huge text description from another organization.
      Negotiating
      Effective leadership
      negotiating techniques
      leadership
      strategic planning

Currently I'm looping over the elements I need by using their absolute paths, since I'm not able to use any of the get or find methods in ElementTree. As such, my code looks like below:

import pandas as pd
import xml.etree.ElementTree as ET   
import numpy as np
import itertools

tree = ET.parse('data.xml')
root = tree.getroot()
root.tag

dfcols=['organization','description','level','keyword']
organization=[]
description=[]
level=[]
keyword=[]

for node in root:
    for child in 
       node.findall('.//{http://something.org/schema/s/program}orgUnitId'):
        organization.append(child.text) 
    for child in node.findall('.//{http://something.org/schema/s/program}programDescriptionText'):
        description.append(child.text) 
    for child in node.findall('.//{http://something.org/schema/s/program}requiredLevel'):
        level.append(child.text)
    for child in node.findall('.//{http://something.org/schema/s/program}searchword'):
        keyword.append(child.text)

The goal, of course, is to create one dataframe. However, since each node in the XML file contains one or multiple elements, such as requiredLevel or searchword I'm currently losing data when I'm casting it to a dataframe by either:

df=pd.DataFrame(list(itertools.zip_longest(organization,
    description,level,searchword,
    fillvalue=np.nan)),columns=dfcols)

or using pd.Series as given here or another solution which I don't seem to get it fit from here

My best bet is not to use Lists at all, since they don't seem to index the data correctly. That is, I lose data from the 2nd to Xth child node. But right now I'm stuck, and don't see any other options.

What my end result should look like is this:

organization    description  level                keyword
Organization 1  ....         academic bachelor,   Scrum master
                             academic master 
Organization 2  ....         bachelor,            Executives
                             academic master, 
                             academic bachelor    
Organization 3  ....                              Negotiating,
                                                  Effective leadership,
                                                  negotiating techniques,
                                                  ....

Parfait · Accepted Answer

Consider building a list of dictionaries with comma-collapsed text values. Then pass list into the pandas.DataFrame constructor:

dicts = []
for node in root:
    orgs = ", ".join([org.text for org in node.findall('.//{http://something.org/schema/s/program}orgUnitId')])
    desc = ", ".join([desc.text for desc in node.findall('.//{http://something.org/schema/s/program}programDescriptionText')])
    lvls = ", ".join([lvl.text for lvl in node.findall('.//{http://something.org/schema/s/program}requiredLevel')])
    wrds = ", ".join([wrd.text for wrd in node.findall('.//{http://something.org/schema/s/program}searchword')])

    dicts.append({'organization': orgs, 'description': desc, 'level': lvls, 'keyword': wrds})

final_df = pd.DataFrame(dicts, columns=['organization','description','level','keyword'])

Output

print(final_df)
#      organization                                        description                                         level                                            keyword
# 0  Organization 1                       Here is some text; blablabla            academic bachelor, academic master                                       Scrum master
# 1  Organization 2   Text from another organization about some stuff.  bachelor, academic master, academic bachelor                                          Excutives
# 2  Organization 3  Also another huge text description from anothe...                                                Negotiating, Effective leadership, negotiating...

Extracting nested XML elements of different sizes into Pandas

Answers (2)

Related Questions