Reputation: 475
I am newly in pandas and I just start my code learning. Please, it would be great if you could help me. I have a simple XML like this and I wanna convert it in a dataframe with pandas
<products_availability date="2020-01-24 06:32" >
<region id="122">
<products count="45453242">
<product id="1000001">0</product>
<product id="1000002">5</product>
<product id="1000003">3</product>
</products>
</region>
</products_availability>`
I use some code but anyway it does not help me:
import pandas as pd
import xml.etree.ElementTree as et
xtree = et.parse("file.xml")
xroot = xtree.getroot()
df_cols = ["product"]
rows = []
for node in xroot:
s_product = node.attrib.get("product")
rows.append({"name": s_product
})
out_df = pd.DataFrame(rows, columns = df_cols)
Upvotes: 1
Views: 217
Reputation: 23815
The code below flatting the xml (region,products,product) into a single record.
import xml.etree.ElementTree as ET
import pandas as pd
xml = '''<products_availability date="2020-01-24 06:32" >
<region id="122">
<products count="45453242">
<product id="1000001">0</product>
<product id="1000002">5</product>
<product id="1000003">3</product>
</products>
</region>
<region id="133">
<products count="45453277">
<product id="1000004">7</product>
<product id="1000005">3</product>
<product id="1000006">1</product>
</products>
</region>
</products_availability>'''
data = []
root = ET.fromstring(xml)
regions = root.findall('.//region')
for region in regions:
region_id = region.attrib['id']
products_count = region.find('./products').attrib['count']
for product in region.findall('.//product'):
entry = {'region_id': region_id, 'products_count': products_count,
'product_id': product.attrib['id'], 'number': product.text}
data.append(entry)
df = pd.DataFrame(data)
print(df)
output
region_id products_count product_id number
0 122 45453242 1000001 0
1 122 45453242 1000002 5
2 122 45453242 1000003 3
3 133 45453277 1000004 7
4 133 45453277 1000005 3
5 133 45453277 1000006 1
Upvotes: 1
Reputation: 4513
If you do not need to classify by region, you can use the findall
method or iterfind
method to find all matching subelements:
<products_availability date="2020-01-24 06:32" > <region id="122"> <products count="45453242"> <product id="1000001">0</product> <product id="1000002">5</product> <product id="1000003">3</product> </products> </region> <region id="133"> <products count="45453242"> <product id="1000004">7</product> <product id="1000005">3</product> <product id="1000006">1</product> </products> </region> </products_availability>
import pandas as pd
import xml.etree.ElementTree as et
columns = ["product", "products_availability"]
xtree = et.parse("file.xml")
products = ((p.get("id"), p.text) for p in xtree.iterfind(".//product"))
out_df = pd.DataFrame(products, columns=columns)
>>> out_df product products_availability 0 1000001 0 1 1000002 5 2 1000003 3 3 1000004 7 4 1000005 3 5 1000006 1
Edition
If the region is needed, just:
import pandas as pd
import xml.etree.ElementTree as et
columns = ["product", "products_availability", "region"]
xtree = et.parse("file.xml")
prds = ((p.get("id"), p.text, r.get("id")) for r in xtree.iterfind(".//region")
for p in r.iterfind(".//product")
)
out_df = pd.DataFrame(prds, columns=columns)
>>> out_df product products_availability region 0 1000001 0 122 1 1000002 5 122 2 1000003 3 122 3 1000004 7 133 4 1000005 3 133 5 1000006 1 133
Upvotes: 1