Parsing multiple XML attribute values and appending to a dataframe/list

Question

I'm currently trying to parse some XML (XML added at end of question) in Python to get the value of multiple attributes. The end goal is being able to flag, first if there is an attribute and then the value of that attribute. I then combine that with the column header to identify that column has an attribute (as not all will).

Imports are as follows:

#---import libraries
import xlwings as xw #---lib to do most of the excel steps
import pandas as pd #---lib to read csv file
import openpyxl as xl #---lib to convert csv to xls
import os #---lib to extract filename for input and folder movement
from datetime import datetime
import xml.etree.ElementTree as Etree

Using the below code I can see which fields have an attribute and then add the text 'custom' to the head.

for elem in tree.iter():
    if bool(elem.attrib):
        ea = str(elem.attrib)
        B.update({elem.tag + " (Custom)": ea})
        A.append(B) # appending B to list

    else:
        B.update({elem.tag: elem.attrib})
        A.append(B) # appending B to list

The result when printing is as follows.

  company department employee name job      salary (Custom)
0      {}         {}       {}   {}  {}  {'datatype': 'int'}

Now this works when there is only one, however I need to be able to achieve a similar result when there is multiple attributes, the example I am working on has two ("custom" and "datatype"). So I need to get both and then add them to what is ultimately the columns in my dataframe / list.

An example might look like:

Job [string]	Salary (custom)[int]
Cell 1	Cell 2

The data below the headers isn't important to me, as from here I will pull only the headers into a new master excel document to create what will ultimately become a sort of report definition / data dictionary where I can parse the XML files of all the reports used across an organisation.

I am just unable to target the attributes I need. The attributes in question are 'Custom' where the value might be 'Yes' and 'datatype' where the values could be 'int', 'string' etc. Ideally I want the value of the custom attribute to be added to the column header with normal brackets () and the datatype value in square brackets []

XML file:

Parfait · Accepted Answer

Consider XSLT, the special-purpose language designed to transform XML files, which is supported in pandas.read_xml() using the default lxml parser and stylesheet argument.

Below XSLT will flatten content to employee level, drawing down ancestor elements, and dynamically handling salary element depending on attributes. Sample data includes a complete XML capturing top earners in pandas and xml StackOverflow tags and using seafaring ranks.

Do note: the brackets and parentheses are not included being disallowed symbols in XML node names. Change as needed after parsing in Python.

XML



    
        
            Jezrael
            Captain
            30463
            806160
        
        
            Ed Chum
            Chief Mate
            3058
            369802
        
        
            Andy Hayden
            Second Mate
            1959
            353578
        
        
            cs95
            Third Mate
            5635
            369908
        
    
    
        
            Dimitre Novatchev
            Captain
            5142
            239598
        
        
            Jon Skeet
            Chief Mate
            35587
            1398836
        
        
            kjhughes
            Second Mate
            2954
            104539
        
        
            Michael Kay
            Third Mate
            8226
            154024

XSLT (save as .xsl script, a special .xml file)

Online Demo

Python (requires third-party lxml package for stylesheet)

import pandas as pd

employees_df = pd.read_xml("myInput.xml", stylesheet="myScript.xsl")

print(employees_df)
         company department               name          job  salary_custom_Yes  salary_datatype_int
0  stackoverflow     python            Jezrael      Captain              30463               806160
1  stackoverflow     python            Ed Chum   Chief Mate               3058               369802
2  stackoverflow     python        Andy Hayden  Second Mate               1959               353578
3  stackoverflow     python               cs95   Third Mate               5635               369908
4  stackoverflow        xml  Dimitre Novatchev      Captain               5142               239598
5  stackoverflow        xml          Jon Skeet   Chief Mate              35587              1398836
6  stackoverflow        xml           kjhughes  Second Mate               2954               104539
7  stackoverflow        xml        Michael Kay   Third Mate               8226               154024

Parsing multiple XML attribute values and appending to a dataframe/list

Answers (1)

Related Questions