Reputation: 93
I have the following xml file. I want to extract all lines starting with ItemDescription and create a dataframe out of them such that one column contains the color, another the ID, another the Letter and so on. How can I do this?
I tried it with the package xml.etree.ElementTree
but I could not produce dataframe because I could not access the elements in the desired lines.
I do not want to use pandas_read_xml
because it works only with pip, I guess.
pd.read_xml
also does not work even though I updated pandas.
Is there a reliable way to do it with xml.etree.ElementTree
or with another package that is not too fancy?
<?xml version="1.0" ?>
<OrderList>
<ItemDescriptions>
<ItemDescription Color="rosybrown" ID="0" Letter="a" Type="Letter" Weight="1.67"/>
<ItemDescription Color="lightcoral" ID="1" Letter="a" Type="Letter" Weight="0.91"/>
<ItemDescription Color="indiaread" ID="2" Letter="a" Type="Letter" Weight="0.62"/>
<ItemDescription Color="brown" ID="3" Letter="a" Type="Letter" Weight="2.92"/>
<ItemDescription Color="firedbrick" ID="4" Letter="a" Type="Letter" Weight="2.34"/>
<ItemDescription Color="maroon" ID="5" Letter="a" Type="Letter" Weight="0.53"/>
<ItemDescription Color="darkred" ID="6" Letter="a" Type="Letter" Weight="2.72"/>
</ItemDescriptions>
<ItemBundles/>
<Orders>
<Order TimeStamp="">
<Positions>
<Position Count="1" ItemDescriptionID="9"/>
<Position Count="1" ItemDescriptionID="18"/>
</Positions>
</Order>
<Order TimeStamp="">
<Positions>
<Position Count="2" ItemDescriptionID="9"/>
<Position Count="1" ItemDescriptionID="12"/>
<Position Count="2" ItemDescriptionID="14"/>
<Position Count="1" ItemDescriptionID="18"/>
<Position Count="1" ItemDescriptionID="16"/>
</Positions>
</Order>
</Orders>
</OrderList>
Upvotes: 0
Views: 100
Reputation: 23825
Use ElementTree (no external lib is required)
import xml.etree.ElementTree as ET
import pandas as pd
xml = '''<?xml version="1.0" ?>
<OrderList>
<ItemDescriptions>
<ItemDescription Color="rosybrown" ID="0" Letter="a" Type="Letter" Weight="1.67"/>
<ItemDescription Color="lightcoral" ID="1" Letter="a" Type="Letter" Weight="0.91"/>
<ItemDescription Color="indiaread" ID="2" Letter="a" Type="Letter" Weight="0.62"/>
<ItemDescription Color="brown" ID="3" Letter="a" Type="Letter" Weight="2.92"/>
<ItemDescription Color="firedbrick" ID="4" Letter="a" Type="Letter" Weight="2.34"/>
<ItemDescription Color="maroon" ID="5" Letter="a" Type="Letter" Weight="0.53"/>
<ItemDescription Color="darkred" ID="6" Letter="a" Type="Letter" Weight="2.72"/>
</ItemDescriptions>
<ItemBundles/>
<Orders>
<Order TimeStamp="">
<Positions>
<Position Count="1" ItemDescriptionID="9"/>
<Position Count="1" ItemDescriptionID="18"/>
</Positions>
</Order>
<Order TimeStamp="">
<Positions>
<Position Count="2" ItemDescriptionID="9"/>
<Position Count="1" ItemDescriptionID="12"/>
<Position Count="2" ItemDescriptionID="14"/>
<Position Count="1" ItemDescriptionID="18"/>
<Position Count="1" ItemDescriptionID="16"/>
</Positions>
</Order>
</Orders>
</OrderList>'''
root = ET.fromstring(xml)
data = [desc.attrib for desc in root.findall('.//ItemDescription')]
df = pd.DataFrame(data)
print(df)
output
Color ID Letter Type Weight
0 rosybrown 0 a Letter 1.67
1 lightcoral 1 a Letter 0.91
2 indiaread 2 a Letter 0.62
3 brown 3 a Letter 2.92
4 firedbrick 4 a Letter 2.34
5 maroon 5 a Letter 0.53
6 darkred 6 a Letter 2.72
Upvotes: 2
Reputation: 120559
Use read_xml
and xpath
:
>>> pd.read_xml('data.xml', xpath='./ItemDescriptions/ItemDescription')
Color ID Letter Type Weight
0 rosybrown 0 a Letter 1.67
1 lightcoral 1 a Letter 0.91
2 indiaread 2 a Letter 0.62
3 brown 3 a Letter 2.92
4 firedbrick 4 a Letter 2.34
5 maroon 5 a Letter 0.53
6 darkred 6 a Letter 2.72
Alternative with lxml
:
from lxml import etree
tree = etree.parse('data.xml')
df = pd.DataFrame([dict(elmt.items())
for elmt in tree.xpath('.//ItemDescription')])
Upvotes: 2