Romeo Botanpi
Romeo Botanpi

Reputation: 51

How to read XML file into Pandas Dataframe

I have a xml file: 'product.xml' that I want to read using pandas, here is an example of the sample file:

<?xml version="1.0"?>
 <Rowset>
  <ROW>
   <Product_ID>32</Product_ID>
   <Company_ID>2</Company_ID>
   <User_ID>90</User_ID>
   <Product_Type>1</Product_Type>
   <Application_ID>BBC#:1010</Application_ID>
  </ROW>
  <ROW>
   <Product_ID>22</Product_ID>
   <Company_ID>4</Company_ID>
   <User_ID>190</User_ID>
   <Product_Type>2</Product_Type>
   <Application_ID>NBA#:1111</Application_ID>
  </ROW>
  <ROW>
   <Product_ID>63</Product_ID>
   <Company_ID>4</Company_ID>
   <User_ID>99</User_ID>
   <Product_Type>1</Product_Type>
   <Application_ID>BBC#:1212</Application_ID>
  </ROW>
  <ROW>
   <Product_ID>22</Product_ID>
   <Company_ID>2</Company_ID>
   <User_ID>65</User_ID>
   <Product_Type>2</Product_Type>
   <Application_ID>NBA#:2210</Application_ID>
  </ROW>

I am trying to generate a 2D Table using pandas like:

Application ID Product Type  Product ID
BBC#:1010            1         32
NBA#:1111            2         22
BBC#:1212            1         63  
NBA#:2210            2         22   

so far, I have tried this code:

import xml.etree.cElementTree as ET
import pandas as pd
import pandas_read_xml as pdx
df = pdx.read_xml('product.xml')
path='product.xml'
dfcols = ['Application_ID', 'Product_Type', 'Product_ID']
root = et.parse(path)
rows = root.findall('.//ROW')
#NESTED LIST
xml_data = [[row.get('Application_ID'), row.get('Product_Type'), row.get('Product_ID')] for row in rows]
df_xml = pd.DataFrame(xml_data, columns=dfcols)

print(df_xml)

How can I print that type of 2D Table?, please help, Thank you.

Upvotes: 3

Views: 13743

Answers (2)

p_sutherland
p_sutherland

Reputation: 491

As of Pandas 1.3.0 there is a read_xml() function that makes working with reading/writing XML data in/out of pandas much easier.

Once you upgrade to Pandas >1.3.0 you can simply use:

df = pd.read_xml("___XML_FILEPATH___")
print(df)

(Note that in the XML sample above the <Rowset> tag needs to be closed)

Upvotes: 2

Corralien
Corralien

Reputation: 120559

Use [] to filter and reorganize columns:

cols = ['Application_ID', 'Product_Type', 'Product_ID']
df = pd.read_xml('product.xml')[cols]
print(df)

# Output:
  Application_ID  Product_Type  Product_ID
0      BBC#:1010             1          32
1      NBA#:1111             2          22
2      BBC#:1212             1          63
3      NBA#:2210             2          22

If you want to replace '_' from your column names by ' ':

df.columns = df.columns.str.replace('_', ' ')
print(df)

# Output:
  Application ID  Product Type  Product ID
0      BBC#:1010             1          32
1      NBA#:1111             2          22
2      BBC#:1212             1          63
3      NBA#:2210             2          22

Upvotes: 2

Related Questions