Reputation: 509
I am trying to create Pandas dataframe out of XML. The XML looks like this:
<?xml version="1.0" encoding="utf-8"?>
<Products>
<Info>
<Msg>Shop items.</Msg>
</Info>
<shop shopNr="01">
<ItemNr>1001</ItemNr>
<ItemNr>1002</ItemNr>
<ItemNr>1003</ItemNr>
<ItemNr>1004</ItemNr>
<ItemNr>1010</ItemNr>
</shop>
<shop shopNr="02">
<ItemNr>1002</ItemNr>
<ItemNr>1006</ItemNr>
<ItemNr>1005</ItemNr>
</shop>
<shop shopNr="03">
<ItemNr>1009</ItemNr>
<ItemNr>1006</ItemNr>
<ItemNr>1005</ItemNr>
<ItemNr>1002</ItemNr>
</shop>
</Products>
I have tried using XML Etree as the code below. I have two problems.
First, I can not get the values of ItemNr as children of the root. Instead of getting the value, ie. 1001, I get
<Element 'ItemNr' at 0x000001E2D6C41B38>.
The second problem is when I am creating a dataframe out of the lists. I end up having a list of lists in items. Although the result is empty now since I could not get the values above, I want to end up with a flattened list.
import xml.etree.ElementTree as ET
import pandas as pd
data = 'example_shops.xml'
tree = ET.parse(data)
root = tree.getroot()
shops = []
items = []
for node in root.iter('shop'):
shops.append(node.attrib.get('shopNr'))
items.append(list(node))
d = {'shops': shops, 'items': items}
df = pd.DataFrame(d)
The DataFrame produced.
shops items
0 01 [[], [], [], [], []]
1 02 [[], [], []]
2 03 [[], [], [], []]
desired output is:
shops items
0 01 [1001, 1002, 1003, 1004, 1010]
1 02 [1002, 1006, 1005]
2 03 [1009, 1006, 1005, 1002]
Upvotes: 1
Views: 5956
Reputation: 672
I hope this is the expected output:
import xml.etree.ElementTree as ET
import pandas as pd
data = 'example_shops.xml'
tree = ET.parse(data)
root = tree.getroot()
shops_items = []
all_shops_items = []
for ashop in root.iter('shop'):
items = []
shop_Nr = ashop.attrib.get('shopNr')
for anitem in ashop.iter('ItemNr'):
items.append(anitem.text)
shops_items = [shop_Nr,items]
all_shops_items.append(shops_items)
df = pd.DataFrame(all_shops_items,columns=['SHOP_NUMBER','ITEM_NUMBER'])
print(df)
Output:
SHOP_NUMBER ITEM_NUMBER
0 01 [1001, 1002, 1003, 1004, 1010]
1 02 [1002, 1006, 1005]
2 03 [1009, 1006, 1005, 1002]
If you want shops with individual items :
import xml.etree.ElementTree as ET
import pandas as pd
data = 'example_shops.xml'
tree = ET.parse(data)
root = tree.getroot()
shops_items = []
all_shops_items = []
for ashop in root.iter('shop'):
shop_Nr = ashop.attrib.get('shopNr')
for anitem in ashop.iter('ItemNr'):
item_Nr = anitem.text
shops_items = [shop_Nr,item_Nr]
all_shops_items.append(shops_items)
df = pd.DataFrame(all_shops_items,columns=['SHOP_NUMBER','ITEM_NUMBER'])
print(df)
output:
SHOP_NUMBER ITEM_NUMBER
0 01 1001
1 01 1002
2 01 1003
3 01 1004
4 01 1010
5 02 1002
6 02 1006
7 02 1005
8 03 1009
9 03 1006
10 03 1005
11 03 1002
Upvotes: 2
Reputation: 2755
You want to append the text values from the ItemNr
elements which are under the shop
element to the items list and not the xml Element python object which is what you were doing.
The following code was working for me:
items.append([item_nr_element.text for item_nr_element in node.getchildren()])
Upvotes: 2