coding learner
coding learner

Reputation: 55

Python - Converting xml to csv using Python pandas

I am new in here and I have been trying to create a small python script to convert xml to csv. Based on my reading various post here in Stackoverflow I have managed to come up with a sample code that works just fine.. However the data I am trying to work with has multiple layers and thus I am unsure how to extract the data at the leaf level.

Given below is how the data looks like:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Transmission>
    <TransmissionBody>
        <level1>
            <level2>
                <level3>
                    <level4>
                        <level5>
                            <level6>
                                <ColA>ABC</ColA>
                                <ColB>123</ColB>
                            </level6>
                        </level5>
                    </level4>
                </level3>
            </level2>
        </level1>
    </TransmissionBody>
</Transmission>

I am trying to use the below code to try converting the xml to csv

import pandas as pd
import xml.etree.ElementTree as ET

tree = ET.parse('file.xml')
root = tree.getroot()
final = {}
for elem in root:
    if len(elem):
        for c in elem.getchildren():
            final[c.tag] = c.text
    else:
        final[elem.tag] = elem.text

df = pd.DataFrame([final])
df.to_csv('file.csv)

This code however just pulls level2 and not ColA from level6.

Expected Output:

Transmission,TransmissionBody,level1,level2,level3,level4,level5,level6,ColA,ColB
,,,,,,,,ABC,123
,,,,,,,,DEF,456

Updated code:

allFiles = glob.glob(folder)
for file in allFiles:
    xmllist = [file]
    for xmlfile in xmllist:
        tree = ET.parse(xmlfile)
        root = tree.getroot()

        def f(elem, result):
            result[elem.tag] = elem.text
            cs = elem.getchildren()
            for c in cs:
                result = f(c, result)
            return result

         d = f(root, {})
         df = pd.DataFrame(d, index=['values'])

Upvotes: 4

Views: 3485

Answers (2)

perl
perl

Reputation: 9941

If I understood your question correctly, you need to traverse the XML tree, so you probably want to have a recursive function that does that. Something like the following:

import pandas as pd
import xml.etree.ElementTree as ET

tree = ET.parse('file.xml')
root = tree.getroot()

def f(elem, result):
    result[elem.tag] = elem.text
    cs = elem.getchildren()
    for c in cs:
        result = f(c, result)
    return result

d = f(root, {})
df = pd.DataFrame(d, index=['values']).T
df

Out:

    values
Transmission    \n
TransmissionBody    \n
level1  \n
level2  \n
level3  \n
level4  \n
level5  \n
level6  \n
ColA    ABC
ColB    123

Update: Here's when we need to do it on multiple XML files. I've added another file similar to the original one with ColA, ColB rows replaced with

<ColA>DEF</ColA>
<ColB>456</ColD>

Here's the code:

def f(elem, result):
    result[elem.tag] = elem.text
    cs = elem.getchildren()
    for c in cs:
        result = f(c, result)
    return result

result = {}
for file in glob.glob('*.xml'):
    tree = ET.parse(file)
    root = tree.getroot()
    result = f(root, result)

df = pd.DataFrame(result, index=['values']).T
df

And the output:

                    0    1
Transmission       \n   \n
TransmissionBody   \n   \n
level1             \n   \n
level2             \n   \n
level3             \n   \n
level4             \n   \n
level5             \n   \n
level6             \n   \n
ColA              ABC  DEF
ColB              123  456

Upvotes: 1

fastali
fastali

Reputation: 29

you have an indentation issue on

if len(elem):

I guess this should resolve it.

Upvotes: 2

Related Questions