Reputation: 47
I have XML file which looks like this:
I'm trying to conver it to DF in Python using this:
import pandas as pd
import xml.etree.ElementTree as et
def parse_XML(xml_file, df_cols):
"""Parse the input XML file and store the result in a pandas
DataFrame with the given columns.
The first element of df_cols is supposed to be the identifier
variable, which is an attribute of each node element in the
XML data; other features will be parsed from the text content
of each sub-element.
"""
xtree = et.parse(xml_file)
xroot = xtree.getroot()
rows = []
for node in xroot:
res = []
res.append(node.attrib.get(df_cols[0]))
for el in df_cols[1:]:
if node is not None and node.find(el) is not None:
res.append(node.find(el).text)
else:
res.append(None)
rows.append({df_cols[i]: res[i]
for i, _ in enumerate(df_cols)})
out_df = pd.DataFrame(rows, columns=df_cols)
return out_df
I'm using function like this:
parse_XML(R'C:\Users\aleks\Desktop\zadania python+sql\import_xml.xml', ['Year', 'Row', 'New', 'Used', 'Total Sales New', 'Total Sales Used'])
But the result instead of full imported XML file is this:
How do I make it work so I have full XML file imported to DataFrame? Thanks.
Upvotes: 0
Views: 387
Reputation: 107767
Since your XML is relatively flat, consider merging dictionaries of element and attribute data to be passed into pandas.DataFrame
constructor. Specifically, dictionaries are derived via list/dict comprehension of tag and text parsing of row
elements and attrib
dictionary.
xtree = et.parse(xml_file)
# LIST OF DICTIONARIES
data = [{ **{r.tag:r.text for r in row.findall('*')}, **row.attrib
} for row in xtree.findall('.//row/row')]
out_df = pd.DataFrame(data)
To demonstrate from OpenData Michigan MVA Vehicle Sales Counts (which resembles OP's image data):
import pandas as pd
import xml.etree.ElementTree as et
from urllib import request
url = "https://opendata.maryland.gov/api/views/un65-7ipd/rows.xml?accessType=DOWNLOAD"
rq = request.urlopen(url)
xtree = et.fromstring(rq.read())
# LIST OF DICTIONARIES
data = [{ **{r.tag:r.text for r in row.findall('*')}, **row.attrib
} for row in xtree.findall('.//row/row')]
out_df = pd.DataFrame(data)
Output
out_df
# year month new used total_sales_new total_sales_used _id _uuid _position _address
#0 2002 JAN 31106 49927 755015820 386481929 row-a49t-ihm3.rpap 00000000-0000-0000-05D6-978893D301BF 0 https://opendata.maryland.gov/resource/un65-7i...
#1 2002 FEB 27520 50982 664454223 361353242 row-njtn_crc8_qawh 00000000-0000-0000-3211-E4A718B8A756 0 https://opendata.maryland.gov/resource/un65-7i...
#2 2002 MAR 34225 58794 805666244 419385387 row-tn8d-5xax-gcqz 00000000-0000-0000-9FC0-AB50DEE551EE 0 https://opendata.maryland.gov/resource/un65-7i...
#3 2002 APR 36452 59817 846368297 433061150 row-bjnm-jnr7.7rbw 00000000-0000-0000-FE23-E2E416AC13DB 0 https://opendata.maryland.gov/resource/un65-7i...
#4 2002 MAY 37359 60577 855005784 442569410 row-5wkh~fkp3.mzuj 00000000-0000-0000-CBC6-6E4FF3707E07 0 https://opendata.maryland.gov/resource/un65-7i...
#.. ... ... ... ... ... ... ... ... ... ...
#221 2020 JUN 23704 63395 821297823 699671992 row-m9ah.mpqr.7eww 00000000-0000-0000-3599-5C2EEABDA225 0 https://opendata.maryland.gov/resource/un65-7i...
#222 2020 JUL 22748 65652 784818757 608761285 row-79ht-rw2j~j4vy 00000000-0000-0000-4E88-6D5061A28AC5 0 https://opendata.maryland.gov/resource/un65-7i...
#223 2020 AUG 12640 48594 456451505 329492612 row-yh7w-48wu_vpnx 00000000-0000-0000-6ED5-CA4A564695EA 0 https://opendata.maryland.gov/resource/un65-7i...
#224 2020 SEP 11336 42148 395164402 274497385 row-e2ur-7zck~4rdd 00000000-0000-0000-C640-6EE6DAA0F07C 0 https://opendata.maryland.gov/resource/un65-7i...
#225 2020 OCT 13271 43827 477111877 328752668 row-yn9u~un9g_3heh 00000000-0000-0000-4EEE-250E14B81104 0 https://opendata.maryland.gov/resource/un65-7i...
Upvotes: 1