Aleksander Kuś
Aleksander Kuś

Reputation: 47

XML import to DataFrame Python

I have XML file which looks like this: enter image description here

I'm trying to conver it to DF in Python using this:

import pandas as pd
import xml.etree.ElementTree as et

def parse_XML(xml_file, df_cols): 
"""Parse the input XML file and store the result in a pandas 
DataFrame with the given columns. 

The first element of df_cols is supposed to be the identifier 
variable, which is an attribute of each node element in the 
XML data; other features will be parsed from the text content 
of each sub-element. 
"""

xtree = et.parse(xml_file)
xroot = xtree.getroot()
rows = []

for node in xroot: 
    res = []
    res.append(node.attrib.get(df_cols[0]))
    for el in df_cols[1:]: 
        if node is not None and node.find(el) is not None:
            res.append(node.find(el).text)
        else: 
            res.append(None)
    rows.append({df_cols[i]: res[i] 
                 for i, _ in enumerate(df_cols)})

out_df = pd.DataFrame(rows, columns=df_cols)
    
return out_df

I'm using function like this:

parse_XML(R'C:\Users\aleks\Desktop\zadania python+sql\import_xml.xml', ['Year', 'Row', 'New', 'Used', 'Total Sales New', 'Total Sales Used'])

But the result instead of full imported XML file is this: enter image description here

How do I make it work so I have full XML file imported to DataFrame? Thanks.

Upvotes: 0

Views: 387

Answers (1)

Parfait
Parfait

Reputation: 107767

Since your XML is relatively flat, consider merging dictionaries of element and attribute data to be passed into pandas.DataFrame constructor. Specifically, dictionaries are derived via list/dict comprehension of tag and text parsing of row elements and attrib dictionary.

xtree = et.parse(xml_file)

# LIST OF DICTIONARIES
data = [{ **{r.tag:r.text for r in row.findall('*')}, **row.attrib
        } for row in xtree.findall('.//row/row')]

out_df = pd.DataFrame(data)

To demonstrate from OpenData Michigan MVA Vehicle Sales Counts (which resembles OP's image data):

import pandas as pd
import xml.etree.ElementTree as et
from urllib import request

url = "https://opendata.maryland.gov/api/views/un65-7ipd/rows.xml?accessType=DOWNLOAD"
rq = request.urlopen(url)
xtree = et.fromstring(rq.read())

# LIST OF DICTIONARIES
data = [{ **{r.tag:r.text for r in row.findall('*')}, **row.attrib
        } for row in xtree.findall('.//row/row')]

out_df = pd.DataFrame(data)

Output

out_df

#     year month    new   used total_sales_new total_sales_used                 _id                                 _uuid _position                                           _address
#0    2002   JAN  31106  49927       755015820        386481929  row-a49t-ihm3.rpap  00000000-0000-0000-05D6-978893D301BF         0  https://opendata.maryland.gov/resource/un65-7i...
#1    2002   FEB  27520  50982       664454223        361353242  row-njtn_crc8_qawh  00000000-0000-0000-3211-E4A718B8A756         0  https://opendata.maryland.gov/resource/un65-7i...
#2    2002   MAR  34225  58794       805666244        419385387  row-tn8d-5xax-gcqz  00000000-0000-0000-9FC0-AB50DEE551EE         0  https://opendata.maryland.gov/resource/un65-7i...
#3    2002   APR  36452  59817       846368297        433061150  row-bjnm-jnr7.7rbw  00000000-0000-0000-FE23-E2E416AC13DB         0  https://opendata.maryland.gov/resource/un65-7i...
#4    2002   MAY  37359  60577       855005784        442569410  row-5wkh~fkp3.mzuj  00000000-0000-0000-CBC6-6E4FF3707E07         0  https://opendata.maryland.gov/resource/un65-7i...
#..    ...   ...    ...    ...             ...              ...                 ...                                   ...       ...                                                ...
#221  2020   JUN  23704  63395       821297823        699671992  row-m9ah.mpqr.7eww  00000000-0000-0000-3599-5C2EEABDA225         0  https://opendata.maryland.gov/resource/un65-7i...
#222  2020   JUL  22748  65652       784818757        608761285  row-79ht-rw2j~j4vy  00000000-0000-0000-4E88-6D5061A28AC5         0  https://opendata.maryland.gov/resource/un65-7i...
#223  2020   AUG  12640  48594       456451505        329492612  row-yh7w-48wu_vpnx  00000000-0000-0000-6ED5-CA4A564695EA         0  https://opendata.maryland.gov/resource/un65-7i...
#224  2020   SEP  11336  42148       395164402        274497385  row-e2ur-7zck~4rdd  00000000-0000-0000-C640-6EE6DAA0F07C         0  https://opendata.maryland.gov/resource/un65-7i...
#225  2020   OCT  13271  43827       477111877        328752668  row-yn9u~un9g_3heh  00000000-0000-0000-4EEE-250E14B81104         0  https://opendata.maryland.gov/resource/un65-7i...

Upvotes: 1

Related Questions