MKJ
MKJ

Reputation: 509

How to create pandas DataFrame from nested xml

I am trying to create Pandas dataframe out of XML. The XML looks like this:

<?xml version="1.0" encoding="utf-8"?>
<Products>
    <Info>
        <Msg>Shop items.</Msg>
    </Info>
    <shop shopNr="01">
        <ItemNr>1001</ItemNr>
        <ItemNr>1002</ItemNr>
        <ItemNr>1003</ItemNr>
        <ItemNr>1004</ItemNr>
                <ItemNr>1010</ItemNr>
    </shop>
    <shop shopNr="02">
        <ItemNr>1002</ItemNr>
        <ItemNr>1006</ItemNr>
        <ItemNr>1005</ItemNr>
    </shop>
    <shop shopNr="03">
        <ItemNr>1009</ItemNr>
        <ItemNr>1006</ItemNr>
        <ItemNr>1005</ItemNr>
        <ItemNr>1002</ItemNr>
    </shop>
</Products>

I have tried using XML Etree as the code below. I have two problems.

First, I can not get the values of ItemNr as children of the root. Instead of getting the value, ie. 1001, I get

<Element 'ItemNr' at 0x000001E2D6C41B38>.

The second problem is when I am creating a dataframe out of the lists. I end up having a list of lists in items. Although the result is empty now since I could not get the values above, I want to end up with a flattened list.

import xml.etree.ElementTree as ET
import pandas as pd
data = 'example_shops.xml'
tree = ET.parse(data)
root = tree.getroot()

shops = []
items = []
for node in root.iter('shop'):
    shops.append(node.attrib.get('shopNr'))
    items.append(list(node))

d = {'shops': shops, 'items': items}
df = pd.DataFrame(d)

The DataFrame produced.

 shops                 items
0    01  [[], [], [], [], []]
1    02          [[], [], []]
2    03      [[], [], [], []]

desired output is:


 shops                 items
0    01  [1001, 1002, 1003, 1004, 1010]
1    02          [1002, 1006, 1005]
2    03      [1009, 1006, 1005, 1002]

Upvotes: 1

Views: 5956

Answers (2)

Ora Aff
Ora Aff

Reputation: 672

I hope this is the expected output:

import xml.etree.ElementTree as ET
import pandas as pd
data = 'example_shops.xml'
tree = ET.parse(data)
root = tree.getroot()
shops_items = []
all_shops_items = []
for ashop in root.iter('shop'):
    items = []
    shop_Nr = ashop.attrib.get('shopNr')
    for anitem in ashop.iter('ItemNr'):
        items.append(anitem.text)
    shops_items = [shop_Nr,items]
    all_shops_items.append(shops_items)
df = pd.DataFrame(all_shops_items,columns=['SHOP_NUMBER','ITEM_NUMBER'])        
print(df)

Output:

  SHOP_NUMBER                     ITEM_NUMBER
0          01  [1001, 1002, 1003, 1004, 1010]
1          02              [1002, 1006, 1005]
2          03        [1009, 1006, 1005, 1002]

If you want shops with individual items :

import xml.etree.ElementTree as ET
import pandas as pd
data = 'example_shops.xml'
tree = ET.parse(data)
root = tree.getroot()
shops_items = []
all_shops_items = []
for ashop in root.iter('shop'):
    shop_Nr = ashop.attrib.get('shopNr')
    for anitem in ashop.iter('ItemNr'):
        item_Nr = anitem.text
        shops_items = [shop_Nr,item_Nr]
        all_shops_items.append(shops_items)
df = pd.DataFrame(all_shops_items,columns=['SHOP_NUMBER','ITEM_NUMBER'])        
print(df)

output:

   SHOP_NUMBER ITEM_NUMBER
0           01        1001
1           01        1002
2           01        1003
3           01        1004
4           01        1010
5           02        1002
6           02        1006
7           02        1005
8           03        1009
9           03        1006
10          03        1005
11          03        1002

Upvotes: 2

cullzie
cullzie

Reputation: 2755

You want to append the text values from the ItemNr elements which are under the shop element to the items list and not the xml Element python object which is what you were doing.

The following code was working for me:

items.append([item_nr_element.text for item_nr_element in node.getchildren()])

Upvotes: 2

Related Questions