Reputation: 55
I am new in here and I have been trying to create a small python script to convert xml to csv. Based on my reading various post here in Stackoverflow I have managed to come up with a sample code that works just fine.. However the data I am trying to work with has multiple layers and thus I am unsure how to extract the data at the leaf level.
Given below is how the data looks like:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Transmission>
<TransmissionBody>
<level1>
<level2>
<level3>
<level4>
<level5>
<level6>
<ColA>ABC</ColA>
<ColB>123</ColB>
</level6>
</level5>
</level4>
</level3>
</level2>
</level1>
</TransmissionBody>
</Transmission>
I am trying to use the below code to try converting the xml to csv
import pandas as pd
import xml.etree.ElementTree as ET
tree = ET.parse('file.xml')
root = tree.getroot()
final = {}
for elem in root:
if len(elem):
for c in elem.getchildren():
final[c.tag] = c.text
else:
final[elem.tag] = elem.text
df = pd.DataFrame([final])
df.to_csv('file.csv)
This code however just pulls level2 and not ColA from level6.
Expected Output:
Transmission,TransmissionBody,level1,level2,level3,level4,level5,level6,ColA,ColB
,,,,,,,,ABC,123
,,,,,,,,DEF,456
Updated code:
allFiles = glob.glob(folder)
for file in allFiles:
xmllist = [file]
for xmlfile in xmllist:
tree = ET.parse(xmlfile)
root = tree.getroot()
def f(elem, result):
result[elem.tag] = elem.text
cs = elem.getchildren()
for c in cs:
result = f(c, result)
return result
d = f(root, {})
df = pd.DataFrame(d, index=['values'])
Upvotes: 4
Views: 3485
Reputation: 9941
If I understood your question correctly, you need to traverse the XML tree, so you probably want to have a recursive function that does that. Something like the following:
import pandas as pd
import xml.etree.ElementTree as ET
tree = ET.parse('file.xml')
root = tree.getroot()
def f(elem, result):
result[elem.tag] = elem.text
cs = elem.getchildren()
for c in cs:
result = f(c, result)
return result
d = f(root, {})
df = pd.DataFrame(d, index=['values']).T
df
Out:
values
Transmission \n
TransmissionBody \n
level1 \n
level2 \n
level3 \n
level4 \n
level5 \n
level6 \n
ColA ABC
ColB 123
Update: Here's when we need to do it on multiple XML files. I've added another file similar to the original one with ColA, ColB rows replaced with
<ColA>DEF</ColA>
<ColB>456</ColD>
Here's the code:
def f(elem, result):
result[elem.tag] = elem.text
cs = elem.getchildren()
for c in cs:
result = f(c, result)
return result
result = {}
for file in glob.glob('*.xml'):
tree = ET.parse(file)
root = tree.getroot()
result = f(root, result)
df = pd.DataFrame(result, index=['values']).T
df
And the output:
0 1
Transmission \n \n
TransmissionBody \n \n
level1 \n \n
level2 \n \n
level3 \n \n
level4 \n \n
level5 \n \n
level6 \n \n
ColA ABC DEF
ColB 123 456
Upvotes: 1
Reputation: 29
you have an indentation issue on
if len(elem):
I guess this should resolve it.
Upvotes: 2