dule91sd
dule91sd

Reputation: 31

How to read nested xml file with python pandas?

I am using python and pandas library to read xml file. And it's great but doesn't read complete file.

python code:

df = pd.read_xml('filename.xml')
df.to_excel('filename.xlsx', index=False)

xml file:

<?xml version="1.0" encoding="UTF-8"?>
<artikli>
    <artikal>
        <id>197</id>
        <sifra>15138</sifra>
        <barKod>8605026307328</barKod>
        <naziv><![CDATA[Torbica pojas sjajna za Nokia N95]]></naziv>
        <kategorija1><![CDATA[MOBILNI]]></kategorija1>
        <kategorija2><![CDATA[!TORBICE NA RASPRODAJI PO MODELU]]></kategorija2>
        <kategorija3><![CDATA[TORBICE ZA NOKIA]]></kategorija3>
        <vpCena>1.0800</vpCena>
        <mpCena>998.9760</mpCena>
        <dostupan>1</dostupan>
        <opis><![CDATA[]]></opis>
        <slike>
            <slika><![CDATA[http://company3g.com/p/19/197/torbica-pojas-sjajna-za-nokia-n95-1890.jpg]]></slika>
            <slika><![CDATA[http://company3g.com/p/19/197/torbica-pojas-sjajna-za-nokia-n95-1890-10064.jpg]]></slika>
            <slika><![CDATA[http://company3g.com/p/19/197/torbica-pojas-sjajna-za-nokia-n95-1890-10065.jpg]]></slika>
            <slika><![CDATA[http://company3g.com/p/19/197/torbica-pojas-sjajna-za-nokia-n95-1890-10066.jpg]]></slika>
        </slike>
    </artikal>
    <artikal>
        <id>198</id>
        <sifra>15163</sifra>
        <barKod>8605026396155</barKod>
        <naziv><![CDATA[Torbica pojas sjajna za Nokia 6303]]></naziv>
        <kategorija1><![CDATA[MOBILNI]]></kategorija1>
        <kategorija2><![CDATA[!TORBICE NA RASPRODAJI PO MODELU]]></kategorija2>
        <kategorija3><![CDATA[TORBICE ZA NOKIA]]></kategorija3>
        <vpCena>1.0800</vpCena>
        <mpCena>998.9760</mpCena>
        <dostupan>1</dostupan>
        <opis><![CDATA[]]></opis>
        <slike>
            <slika><![CDATA[http://company3g.com/p/19/198/torbica-pojas-sjajna-za-nokia-6303-1891-132477.png]]></slika>
            <slika><![CDATA[http://company3g.com/p/19/198/torbica-pojas-sjajna-za-nokia-6303-1891-132478.png]]></slika>
            <slika><![CDATA[http://company3g.com/p/19/198/torbica-pojas-sjajna-za-nokia-6303-1891-132479.png]]></slika>
        </slike>
    </artikal>
</artikli>

Output excel file:

id  sifra   barKod  naziv   kategorija1 kategorija2 kategorija3 vpCena  mpCena  dostupan    opis    slike
197 15138   8605026307328   Torbica pojas sjajna za Nokia N95   MOBILNI !TORBICE NA RASPRODAJI PO MODELU    TORBICE ZA NOKIA    1,08    998,976 1       
198 15163   8605026396155   Torbica pojas sjajna za Nokia 6303  MOBILNI !TORBICE NA RASPRODAJI PO MODELU    TORBICE ZA NOKIA    1,08    998,976 1       

So I am missing data in column "slike" from file. How to get that data? Desired table should looks like this:

artikal|id  artikal|sifra   artikal|barKod  artikal|naziv   artikal|kategorija1 artikal|kategorija2 artikal|kategorija3 artikal|vpCena  artikal|mpCena  artikal|dostupan    artikal|opis    artikal|slike|slika
197 15138   8605026307328   Torbica pojas sjajna za Nokia N95   MOBILNI !TORBICE NA RASPRODAJI PO MODELU    TORBICE ZA NOKIA    1.0800  998.9760    1       [http://company3g.com/p/19/197/torbica-pojas-sjajna-za-nokia-n95-1890.jpg, http://company3g.com/p/19/197/torbica-pojas-sjajna-za-nokia-n95-1890-10064.jpg, http://company3g.com/p/19/197/torbica-pojas-sjajna-za-nokia-n95-1890-10065.jpg, http://company3g.com/p/19/197/torbica-pojas-sjajna-za-nokia-n95-1890-10066.jpg]
198 15163   8605026396155   Torbica pojas sjajna za Nokia 6303  MOBILNI !TORBICE NA RASPRODAJI PO MODELU    TORBICE ZA NOKIA    1.0800  998.9760    1       [http://company3g.com/p/19/198/torbica-pojas-sjajna-za-nokia-6303-1891-132477.png, http://company3g.com/p/19/198/torbica-pojas-sjajna-za-nokia-6303-1891-132478.png, http://company3g.com/p/19/198/torbica-pojas-sjajna-za-nokia-6303-1891-132479.png]

Upvotes: 3

Views: 9454

Answers (2)

Amir Charkhi
Amir Charkhi

Reputation: 836

You start by reading the xml file and also making a placeholder file for you to write the output in a csv format (or any other text format - you might have to tweak the code a bit).

Then you specify the names of columns in your final dataframe (after you have parsed the xml file). But this information is already in your xml file anyways, so you just to make sure you understand the contents.

Lastly, loop over the entries and find the keywords (column names) to read and write to the csv.

Once done, you can read the csv using pd.read_csv('output.csv').

import xml.etree.ElementTree as ET
import csv

# Load and parse the XML file
tree = ET.parse('your_xml_file.xml')
root = tree.getroot()

# Define the CSV file and writer
csv_file = open('output.csv', 'w', newline='', encoding='utf-8')
csv_writer = csv.writer(csv_file)

# Write header row
header = ['column1', 'column2', 'column3', 'column4', 'column5']
csv_writer.writerow(header)

# Extract data and write to CSV
for id in root.findall('.//main_identifier'):
    column1_text = id.find('column1').text if id.find('column') is not None else ''
    column2_text = id.find('.//column2').text if id.find('.//column2') is not None else ''
    column3_text = id.find('.//column3').text if id.find('.//column3') is not None else ''
    column4 = id.find('.//column4').text if id.find('.//column4') is not None else ''
    column5_text = id.find('.//column5').text if id.find('.//column5') is not None else ''
    
    # Write data to CSV
    csv_writer.writerow([column1_text, column2_text, column3_text, column4_text, column5_text])

# Close the CSV file
csv_file.close()

Upvotes: 0

Parfait
Parfait

Reputation: 107567

While XML as a data format can take many forms from flat to deeply nested, data frames must adhere to a single structure of two dimensions: row by column. Hence, as noted in docs, pandas.read_xml, is a convenience method best for flatter, shallow XML files. You can use xpath to traverse different areas of the document, not just the default /*.

However, you can use XSLT 1.0 (special purpose language designed to transform XML files) with the default parser, lxml, to transform any XML to the needed flat format of data frame. Below stylesheet will restyle the <slike> node for comma-separated text of its children <slika>:

XSLT (save as .xsl file, a special .xml file)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" omit-xml-declaration="no" indent="yes"/>
    <xsl:strip-space elements="*"/>

    <xsl:template match="node()|@*">
     <xsl:copy>
       <xsl:apply-templates select="node()|@*"/>
     </xsl:copy>
    </xsl:template>
    
    <xsl:template match="slike">
     <xsl:copy>
       <xsl:for-each select="*">
         <xsl:value-of select="text()"/>
         <xsl:if test="position() != last()">
            <xsl:text>,</xsl:text>
         </xsl:if>
       </xsl:for-each>
     </xsl:copy>
    </xsl:template>  
</xsl:stylesheet>

Online Demo

Python

artikal_df = pd.read_xml("my_filename.xml", stylesheet="my_style.xsl") 

# CONVERT COMMA-SEPARATED VALUES TO EMBEDDED LISTS
artikal_df["slike"] = artikal_df["slike"].str.split(',')

# PREFIX PARENT NODE NAME
artikal_df = artikal_df.add_prefix('artikal_')

artikal_df.info()
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 2 entries, 0 to 1
# Data columns (total 12 columns):
#  #   Column               Non-Null Count  Dtype  
# ---  ------               --------------  -----  
#  0   artikal_id           2 non-null      int64  
#  1   artikal_sifra        2 non-null      int64  
#  2   artikal_barKod       2 non-null      int64  
#  3   artikal_naziv        2 non-null      object 
#  4   artikal_kategorija1  2 non-null      object 
#  5   artikal_kategorija2  2 non-null      object 
#  6   artikal_kategorija3  2 non-null      object 
#  7   artikal_vpCena       2 non-null      float64
#  8   artikal_mpCena       2 non-null      float64
#  9   artikal_dostupan     2 non-null      int64  
#  10  artikal_opis         0 non-null      float64
#  11  artikal_slike        2 non-null      object 
# dtypes: float64(3), int64(4), object(5)
# memory usage: 320.0+ bytes

Upvotes: 5

Related Questions