Mologan
Mologan

Reputation: 37

Read xml file to dataframe in python

I got a xml file need to read in python as dataframe, it is a part of the xml code:

<?xml version="1.0" encoding="UTF-8"?>
<root>
    <data id="root_661191">
        <index id="data_162062">
            <item id="index_829361_1">173915</item>
            <item id="index_829361_2">14712</item>
            <item id="index_829361_3">321255</item>
        </index>
        <property_id id="data_809625">
            <item id="property_id_844926_1">88942.0</item>
            <item id="property_id_844926_2">88162.0</item>
            <item id="property_id_844926_3">80553.0</item>
        </property_id>
        <addr_street id="data_409265">
            <item id="addr_street_959977_1">58 Middleton Street</item>
            <item id="addr_street_959977_2">24 Royena Road</item>
            <item id="addr_street_959977_3">9 Cafardi Boulevard</item>
        </addr_street>
        <price id="data_784942">
           <item id="price_225606_1">7480000.0</item>
           <item id="price_225606_2">7728000.0</item>
           <item id="price_225606_3">7659000.0</item> 
        </price>
    </data>
</root>

I try some easier sample data to test my read function, they work. But when I use my function to do this xml file it only produce None in output. I think it might be col names, but I don't know how to fix it, could anyone help me? The function I used is:

import pandas as pd
import xml.etree.ElementTree as et

def parse_xml(xml_file, df_cols): 
    
    xtree = et.parse(xml_file)
    xroot = xtree.getroot()
    rows = []
    
    for node in xroot: 
        res = []
        res.append(node.attrib.get(df_cols[0]))
        for el in df_cols[1:]: 
            if node is not None and node.find(el) is not None:
                res.append(node.find(el).text)
            else: 
                res.append(None)
        rows.append({df_cols[i]: res[i] 
                     for i, _ in enumerate(df_cols)})
    
    out_df = pd.DataFrame(rows, columns=df_cols)
        
    return out_df

df_cols = ['index','property_id','addr_street','price']
parse_xml['myxmlfile.xml',df_cols]

Upvotes: 0

Views: 2158

Answers (1)

Jonathan Leon
Jonathan Leon

Reputation: 5648

I think this is what you want. You should be able to put this in a function if you need

tree = et.parse('myxmlfile.xml')
root = tree.getroot()

df_cols = ['index','property_id','addr_street','price']
mlist = []
for col in df_cols:
    for d in root.findall('data'):
        # print(d.attrib)
        for c in d.findall(col):
            # print(c)
            # print(c.attrib)
            # print(c.attrib.get('id'))
            lst = []
            for itm in c.findall('item'):
                # print(itm.text)
                lst.append(itm.text)
        # print({col:lst})
        mlist.append(pd.DataFrame({col:lst}))
mlist
pd.concat(mlist, axis=1)

Output:

    index property_id          addr_street      price
0  173915     88942.0  58 Middleton Street  7480000.0
1   14712     88162.0       24 Royena Road  7728000.0
2  321255     80553.0  9 Cafardi Boulevard  7659000.0

Upvotes: 1

Related Questions