and_and
and_and

Reputation: 475

How to convert xml to dataFrame with pandas

I am newly in pandas and I just start my code learning. Please, it would be great if you could help me. I have a simple XML like this and I wanna convert it in a dataframe with pandas

   <products_availability date="2020-01-24 06:32" >
    <region id="122">
        <products count="45453242">
            <product id="1000001">0</product>
            <product id="1000002">5</product>
            <product id="1000003">3</product>
        </products>
   </region>
   </products_availability>`

I use some code but anyway it does not help me:

    import pandas as pd
    import xml.etree.ElementTree as et
    xtree = et.parse("file.xml")
    xroot = xtree.getroot()
    df_cols = ["product"]
    rows = []
    for node in xroot:
    s_product = node.attrib.get("product")
    rows.append({"name":  s_product
                 })
    out_df = pd.DataFrame(rows, columns = df_cols)

enter image description here

Upvotes: 1

Views: 217

Answers (2)

balderman
balderman

Reputation: 23815

The code below flatting the xml (region,products,product) into a single record.

import xml.etree.ElementTree as ET

import pandas as pd

xml = '''<products_availability date="2020-01-24 06:32" >
    <region id="122">
        <products count="45453242">
            <product id="1000001">0</product>
            <product id="1000002">5</product>
            <product id="1000003">3</product>
        </products>
    </region>
    <region id="133">
        <products count="45453277">
            <product id="1000004">7</product>
            <product id="1000005">3</product>
            <product id="1000006">1</product>
        </products>
    </region>
</products_availability>'''

data = []
root = ET.fromstring(xml)
regions = root.findall('.//region')
for region in regions:
    region_id = region.attrib['id']
    products_count = region.find('./products').attrib['count']
    for product in region.findall('.//product'):
        entry = {'region_id': region_id, 'products_count': products_count,
                 'product_id': product.attrib['id'], 'number': product.text}
        data.append(entry)
df = pd.DataFrame(data)
print(df)

output

  region_id products_count product_id number
0       122       45453242    1000001      0
1       122       45453242    1000002      5
2       122       45453242    1000003      3
3       133       45453277    1000004      7
4       133       45453277    1000005      3
5       133       45453277    1000006      1

Upvotes: 1

FJSevilla
FJSevilla

Reputation: 4513

If you do not need to classify by region, you can use the findall method or iterfind method to find all matching subelements:

<products_availability date="2020-01-24 06:32" >
    <region id="122">
        <products count="45453242">
            <product id="1000001">0</product>
            <product id="1000002">5</product>
            <product id="1000003">3</product>
        </products>
    </region>
    <region id="133">
        <products count="45453242">
            <product id="1000004">7</product>
            <product id="1000005">3</product>
            <product id="1000006">1</product>
        </products>
    </region>
</products_availability>
import pandas as pd
import xml.etree.ElementTree as et


columns = ["product", "products_availability"]
xtree = et.parse("file.xml")
products = ((p.get("id"), p.text) for p in xtree.iterfind(".//product"))
out_df = pd.DataFrame(products, columns=columns)
>>> out_df


   product products_availability
0  1000001                     0
1  1000002                     5
2  1000003                     3
3  1000004                     7
4  1000005                     3
5  1000006                     1

Edition

If the region is needed, just:

import pandas as pd
import xml.etree.ElementTree as et


columns = ["product", "products_availability", "region"]
xtree = et.parse("file.xml")
prds = ((p.get("id"), p.text, r.get("id")) for r in xtree.iterfind(".//region")
            for p in r.iterfind(".//product")
            )

out_df = pd.DataFrame(prds, columns=columns)
>>> out_df


   product products_availability region
0  1000001                     0    122
1  1000002                     5    122
2  1000003                     3    122
3  1000004                     7    133
4  1000005                     3    133
5  1000006                     1    133

Upvotes: 1

Related Questions