JLO
JLO

Reputation: 3

Parsing multiple XML attribute values and appending to a dataframe/list

I'm currently trying to parse some XML (XML added at end of question) in Python to get the value of multiple attributes. The end goal is being able to flag, first if there is an attribute and then the value of that attribute. I then combine that with the column header to identify that column has an attribute (as not all will).

Imports are as follows:

#---import libraries
import xlwings as xw #---lib to do most of the excel steps
import pandas as pd #---lib to read csv file
import openpyxl as xl #---lib to convert csv to xls
import os #---lib to extract filename for input and folder movement
from datetime import datetime
import xml.etree.ElementTree as Etree

Using the below code I can see which fields have an attribute and then add the text 'custom' to the head.

for elem in tree.iter():
    if bool(elem.attrib):
        ea = str(elem.attrib)
        B.update({elem.tag + " (Custom)": ea})
        A.append(B) # appending B to list

    else:
        B.update({elem.tag: elem.attrib})
        A.append(B) # appending B to list

The result when printing is as follows.

  company department employee name job      salary (Custom)
0      {}         {}       {}   {}  {}  {'datatype': 'int'}

Now this works when there is only one, however I need to be able to achieve a similar result when there is multiple attributes, the example I am working on has two ("custom" and "datatype"). So I need to get both and then add them to what is ultimately the columns in my dataframe / list.

An example might look like:

Job [string] Salary (custom)[int]
Cell 1 Cell 2

The data below the headers isn't important to me, as from here I will pull only the headers into a new master excel document to create what will ultimately become a sort of report definition / data dictionary where I can parse the XML files of all the reports used across an organisation.

I am just unable to target the attributes I need. The attributes in question are 'Custom' where the value might be 'Yes' and 'datatype' where the values could be 'int', 'string' etc. Ideally I want the value of the custom attribute to be added to the column header with normal brackets () and the datatype value in square brackets []

XML file:

<?xml version="1.0" encoding="UTF-8"?>
<company>
    <department>
        <employee>
            <name>
            </name>
            <job>
            </job>
            <salary  custom = 'Yes'>
            </salary>
            <salary  datatype = 'int'>
            </salary>
        </employee>
        <employee>
            <name>
            </name>
            <job>
            </job>
            <salary  custom = 'Yes'>
            </salary>
            <salary  datatype = 'int'>
            </salary>
        </employee>
        <employee>
            <name>
            </name>
            <job>
            </job>
            <salary  custom = 'Yes'>
            </salary>
            <salary  datatype = 'int'>
            </salary>
        </employee>
        <employee>
            <name>
            </name>
            <job>
            </job>
            <salary  custom = 'Yes'>
            </salary>
            <salary  datatype = 'int'>
            </salary>
        </employee>
    </department>
    <department>
        <employee>
            <name>
            </name>
            <job>
            </job>
            <salary  custom = 'Yes'>
            </salary>
            <salary  datatype = 'int'>
            </salary>
        </employee>
        <employee>
            <name>
            </name>
            <job>
            </job>
            <salary  custom = 'Yes'>
            </salary>
            <salary  datatype = 'int'>
            </salary>
        </employee>
        <employee>
            <name>
            </name>
            <job>
            </job>
            <salary  custom = 'Yes'>
            </salary>
            <salary  datatype = 'int'>
            </salary>
        </employee>
    </department>
</company>

Upvotes: 0

Views: 156

Answers (1)

Parfait
Parfait

Reputation: 107652

Consider XSLT, the special-purpose language designed to transform XML files, which is supported in pandas.read_xml() using the default lxml parser and stylesheet argument.

Below XSLT will flatten content to employee level, drawing down ancestor elements, and dynamically handling salary element depending on attributes. Sample data includes a complete XML capturing top earners in pandas and xml StackOverflow tags and using seafaring ranks.

Do note: the brackets and parentheses are not included being disallowed symbols in XML node names. Change as needed after parsing in Python.

XML

<?xml version="1.0" encoding="UTF-8"?>
<company name="stackoverflow">
    <department name="python">
        <employee>
            <name>Jezrael</name>
            <job>Captain</job>
            <salary custom="Yes">30463</salary>
            <salary datatype="int">806160</salary>
        </employee>
        <employee>
            <name>Ed Chum</name>
            <job>Chief Mate</job>
            <salary custom="Yes">3058</salary>
            <salary datatype="int">369802</salary>
        </employee>
        <employee>
            <name>Andy Hayden</name>
            <job>Second Mate</job>
            <salary custom="Yes">1959</salary>
            <salary datatype="int">353578</salary>
        </employee>
        <employee>
            <name>cs95</name>
            <job>Third Mate</job>
            <salary custom="Yes">5635</salary>
            <salary datatype="int">369908</salary>
        </employee>
    </department>
    <department name="xml">
        <employee>
            <name>Dimitre Novatchev</name>
            <job>Captain</job>
            <salary custom="Yes">5142</salary>
            <salary datatype="int">239598</salary>
        </employee>
        <employee>
            <name>Jon Skeet</name>
            <job>Chief Mate</job>
            <salary custom="Yes">35587</salary>
            <salary datatype="int">1398836</salary>
        </employee>
        <employee>
            <name>kjhughes</name>
            <job>Second Mate</job>
            <salary custom="Yes">2954</salary>
            <salary datatype="int">104539</salary>
        </employee>
        <employee>
            <name>Michael Kay</name>
            <job>Third Mate</job>
            <salary custom="Yes">8226</salary>
            <salary datatype="int">154024</salary>
        </employee>
    </department>
</company>

XSLT (save as .xsl script, a special .xml file)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" omit-xml-declaration="yes" indent="yes"/>
    <xsl:strip-space elements="*"/>

    <!-- PARSE ONLY employee LEVEL ELEMENTS -->
    <xsl:template match="/company">
     <employees>
       <xsl:apply-templates select="descendant::employee"/>
     </employees>
    </xsl:template>

    <!-- RESTYLE employee ELEMENTS, DRAW DOWN ANCESTORS -->
    <xsl:template match="employee">
     <xsl:copy>
       <company><xsl:value-of select="ancestor::company/@name"/></company>
       <department><xsl:value-of select="ancestor::department/@name"/></department>
       <xsl:copy-of select="name|job"/>
       <xsl:apply-templates select="salary"/>
     </xsl:copy>
    </xsl:template>
    
    <!-- DYNAMICALLY HANDLE salary ELEMENTS -->
    <xsl:template match="salary">
      <xsl:element name="{concat(name(), '_', name(@*), '_', @*)}">
        <xsl:value-of select="text()"/>
      </xsl:element>
    </xsl:template>
</xsl:stylesheet>

Online Demo

Python (requires third-party lxml package for stylesheet)

import pandas as pd

employees_df = pd.read_xml("myInput.xml", stylesheet="myScript.xsl")

print(employees_df)
         company department               name          job  salary_custom_Yes  salary_datatype_int
0  stackoverflow     python            Jezrael      Captain              30463               806160
1  stackoverflow     python            Ed Chum   Chief Mate               3058               369802
2  stackoverflow     python        Andy Hayden  Second Mate               1959               353578
3  stackoverflow     python               cs95   Third Mate               5635               369908
4  stackoverflow        xml  Dimitre Novatchev      Captain               5142               239598
5  stackoverflow        xml          Jon Skeet   Chief Mate              35587              1398836
6  stackoverflow        xml           kjhughes  Second Mate               2954               104539
7  stackoverflow        xml        Michael Kay   Third Mate               8226               154024

Upvotes: 1

Related Questions