create a pandas dataframe from a nested xml file

Question

Here is a small section of an xml file. I would like to create a database from this with each tag unique columns names and non-duplicated data.

Tried using lxml and the best I have been able to do so far is to create a dataframe that results in something like this:

"    
SRCSGT
DATE    11112017
AGENCY  Department of Veterans Affairs
OFFICE  Canandaigua VAMC   
LOCATION    Department of Veterans Affairs Medical Center
ZIP 14424
etc, etc, "

The xml



  
    11112017
    
    
    
    14424
    H
    238210
    
    
    
    11172017
    12172017
    CONTRACT SPECIALIST]]>
    
    
    
      
      
    
    N/A
    N

code I wrote

from lxml import etree as et
import pandas as pd

trees = et.parse('test.xml') #get xml file
root = trees.getroot() #get to root of file

tags = [] #list for holding all tags
datas = [] #list for holding all data in tags


for child in root: #root is a list of all elements in the xml file
    #print(child.tag)
    tt = child.tag #reads xml tag
    tags.append(tt)
    datas.append(child.text) #read xml tag data
    for c in child.findall('./'): # ./ finds children
        tt1 = c.tag
        tags.append(str(tt1))
        datas.append(c.text)
        for i in c.findall('./'): #each child node loads a new list of elements
            tt2 = i.tag
            tags.append(str(tt1)+ '_' + str(tt2))
            datas.append(i.text)
            for j in i.findall('./'):
                tt3 = j.tag
                tags.append(str(tt1)+ '_' + str(tt2) + '_' + str(tt3))
                datas.append(j.text)
                for k in j.findall('./'):
                    tt4 = k.tag
                    tags.append(str(tt1)+ '_' + str(tt2) + '_' + str(tt3) + '_' + str(tt4))
                    datas.append(k.text)

df = pd.DataFrame({"tags": tags,"values": datas})

The desired solution is something like this

 date agency office
1/1/10  A1    O1
1/1/10  A2    O2
1/1/10  A3    O3

So basically the tags should turn into column headers and must be populated. The column names should not be repeated so I can create a standard database table.

create a pandas dataframe from a nested xml file

The xml

code I wrote

Answers (1)

Related Questions