Reputation: 37
I got a xml file need to read in python as dataframe, it is a part of the xml code:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<data id="root_661191">
<index id="data_162062">
<item id="index_829361_1">173915</item>
<item id="index_829361_2">14712</item>
<item id="index_829361_3">321255</item>
</index>
<property_id id="data_809625">
<item id="property_id_844926_1">88942.0</item>
<item id="property_id_844926_2">88162.0</item>
<item id="property_id_844926_3">80553.0</item>
</property_id>
<addr_street id="data_409265">
<item id="addr_street_959977_1">58 Middleton Street</item>
<item id="addr_street_959977_2">24 Royena Road</item>
<item id="addr_street_959977_3">9 Cafardi Boulevard</item>
</addr_street>
<price id="data_784942">
<item id="price_225606_1">7480000.0</item>
<item id="price_225606_2">7728000.0</item>
<item id="price_225606_3">7659000.0</item>
</price>
</data>
</root>
I try some easier sample data to test my read function, they work. But when I use my function to do this xml file it only produce None in output. I think it might be col names, but I don't know how to fix it, could anyone help me? The function I used is:
import pandas as pd
import xml.etree.ElementTree as et
def parse_xml(xml_file, df_cols):
xtree = et.parse(xml_file)
xroot = xtree.getroot()
rows = []
for node in xroot:
res = []
res.append(node.attrib.get(df_cols[0]))
for el in df_cols[1:]:
if node is not None and node.find(el) is not None:
res.append(node.find(el).text)
else:
res.append(None)
rows.append({df_cols[i]: res[i]
for i, _ in enumerate(df_cols)})
out_df = pd.DataFrame(rows, columns=df_cols)
return out_df
df_cols = ['index','property_id','addr_street','price']
parse_xml['myxmlfile.xml',df_cols]
Upvotes: 0
Views: 2158
Reputation: 5648
I think this is what you want. You should be able to put this in a function if you need
tree = et.parse('myxmlfile.xml')
root = tree.getroot()
df_cols = ['index','property_id','addr_street','price']
mlist = []
for col in df_cols:
for d in root.findall('data'):
# print(d.attrib)
for c in d.findall(col):
# print(c)
# print(c.attrib)
# print(c.attrib.get('id'))
lst = []
for itm in c.findall('item'):
# print(itm.text)
lst.append(itm.text)
# print({col:lst})
mlist.append(pd.DataFrame({col:lst}))
mlist
pd.concat(mlist, axis=1)
Output:
index property_id addr_street price
0 173915 88942.0 58 Middleton Street 7480000.0
1 14712 88162.0 24 Royena Road 7728000.0
2 321255 80553.0 9 Cafardi Boulevard 7659000.0
Upvotes: 1