Loop through XML in Python

Question

My data set is as following:

Currently I have two loops, one iterates thourgh child data, other through granchild

import pandas
import xml.etree.ElementTree as element_tree
from xml.etree.ElementTree import parse

tree = element_tree.parse('')
root = tree.getroot()
name_space = {'ns0': 'http://SOMELINK'}

#root
date_from = root.attrib['date']
print(date_from)

#child
for pharma in root.findall('.//ns0:dept', name_space):
    for key, value in pharma.items():
        print(key +': ' + value)
    
#granchild, this must be merged to above so entire script will iterate through entire dept node to move to the next
for owner in root.findall('.//ns0:dept/ns0:owners/ns0:currentowner', name_space):
    owner_dict = {}
    
    for key, value in owner.items():
        print(key +': ' + value)

Current result is:

2021-01-15
dept_id: 00001
col_two: 00001value
col_three: 00001false
dept_id: 00002
col_two: 00002value
col_three: 00002value
col_four: 00001value
col_five: 00001value
col_six: 00001false
col_four: 00002value
col_five: 00002value
col_six: 00002false

I am aiming at nested look that will firstly iterate entire dept child with its granchildren and only then move to the next one. Expected result would be below set to be later transformed into pandas' dataframe (I will try to work on this next). Some columns have same name between child/granchild thus prefix would be required or looping through only specific children.

dept.dept_id: 00001
dept.col_two: 00001value
dept.col_three: 00001false
dept.name: some_name
currentowner.col_four: 00001value
currentowner.col_five: 00001value
currentowner.col_six: 00001false
currentowner.name: some_name

currentowner.col_four: 00001bvalue
currentowner.col_five: 00001bvalue
currentowner.col_six: 00001bfalse
currentowner.name: some_name

addr.col_seven: 00001value
addr.col_eight: 00001value
addr.col_nine: 00001false

dept.dept_id: 00002
dept.col_two: 00002value
dept.col_three: 00002value
dept.name: some_name
currentowner.col_four: 00002value
currentowner.col_five: 00002value
currentowner.col_six: 00002false
currentowner.name: some_name
addr.col_seven: 00002value
addr.col_eight: 00002value
addr.col_nine: 00002false

[UPDATE] - I came across zip which should do the trick.

dept_list = []
for item in root.iterfind('.//ns0:dept', name_space):
    #print(item.attrib)
    dept_list.append(item.attrib)
#print(dept_list)


owner_list = []
for item in root.iterfind('.//ns0:dept/ns0:owners/ns0:currentowner', name_space):
    #print(item.attrib)
    owner_list.append(item.attrib)
#print(owner_list)

zipped = zip(dept_list, owner_list)

Rob Raymond · Accepted Answer

Looping can be done in a list comprehension then building dict from navigating the DOM. Following code goes straight to a data frame.

xml = """
  
    
      
        
      
    
  
  
    
      
        
      
    
   
"""

import xml.etree.ElementTree as ET
import pandas as pd

root = ET.fromstring(xml)

root.attrib
ns = {'ns0': 'http://SOMELINK'}
pd.DataFrame([{**d.attrib, 
  **d.find("ns0:owners/ns0:currentowner", ns).attrib, 
  **d.find("ns0:owners/ns0:currentowner/ns0:addr", ns).attrib}
 for d in root.findall("ns0:dept", ns)
])

safer version

if any dept had no currentowner or currentowner/addr using .attrib would fail. Walk the DOM considering these elements to be optional. dict keys construction changed to name based on tag of element as well as attribute name. Structure the way the comprehensions are structured based on your data design. Need to consider 1 to 1, 1 to optional, 1 to many. Really goes back to papers that Codd wrote in 1970

import xml.etree.ElementTree as ET
import pandas as pd

root = ET.fromstring(xml)
ns = {'ns0': 'http://SOMELINK'}
pd.DataFrame([{**{f"{d.tag.split('}')[1]}.{k}":v for k,v in d.items()}, 
  **{f"{co.tag.split('}')[1]}.{k}":v  for k,v in co.items()}, 
  **{f"{addr.tag.split('}')[1]}.{k}":v for addr in co.findall("ns0:addr", ns) for k,v in addr.items()} }
 for d in root.findall("ns0:dept", ns)
 for co in d.findall("ns0:owners/ns0:currentowner", ns)
])

Loop through XML in Python

Answers (2)

safer version

Related Questions