Reputation: 3
I'm currently trying to parse some XML (XML added at end of question) in Python to get the value of multiple attributes. The end goal is being able to flag, first if there is an attribute and then the value of that attribute. I then combine that with the column header to identify that column has an attribute (as not all will).
Imports are as follows:
#---import libraries
import xlwings as xw #---lib to do most of the excel steps
import pandas as pd #---lib to read csv file
import openpyxl as xl #---lib to convert csv to xls
import os #---lib to extract filename for input and folder movement
from datetime import datetime
import xml.etree.ElementTree as Etree
Using the below code I can see which fields have an attribute and then add the text 'custom' to the head.
for elem in tree.iter():
if bool(elem.attrib):
ea = str(elem.attrib)
B.update({elem.tag + " (Custom)": ea})
A.append(B) # appending B to list
else:
B.update({elem.tag: elem.attrib})
A.append(B) # appending B to list
The result when printing is as follows.
company department employee name job salary (Custom)
0 {} {} {} {} {} {'datatype': 'int'}
Now this works when there is only one, however I need to be able to achieve a similar result when there is multiple attributes, the example I am working on has two ("custom" and "datatype"). So I need to get both and then add them to what is ultimately the columns in my dataframe / list.
An example might look like:
Job [string] | Salary (custom)[int] |
---|---|
Cell 1 | Cell 2 |
The data below the headers isn't important to me, as from here I will pull only the headers into a new master excel document to create what will ultimately become a sort of report definition / data dictionary where I can parse the XML files of all the reports used across an organisation.
I am just unable to target the attributes I need. The attributes in question are 'Custom' where the value might be 'Yes' and 'datatype' where the values could be 'int', 'string' etc. Ideally I want the value of the custom attribute to be added to the column header with normal brackets () and the datatype value in square brackets []
XML file:
<?xml version="1.0" encoding="UTF-8"?>
<company>
<department>
<employee>
<name>
</name>
<job>
</job>
<salary custom = 'Yes'>
</salary>
<salary datatype = 'int'>
</salary>
</employee>
<employee>
<name>
</name>
<job>
</job>
<salary custom = 'Yes'>
</salary>
<salary datatype = 'int'>
</salary>
</employee>
<employee>
<name>
</name>
<job>
</job>
<salary custom = 'Yes'>
</salary>
<salary datatype = 'int'>
</salary>
</employee>
<employee>
<name>
</name>
<job>
</job>
<salary custom = 'Yes'>
</salary>
<salary datatype = 'int'>
</salary>
</employee>
</department>
<department>
<employee>
<name>
</name>
<job>
</job>
<salary custom = 'Yes'>
</salary>
<salary datatype = 'int'>
</salary>
</employee>
<employee>
<name>
</name>
<job>
</job>
<salary custom = 'Yes'>
</salary>
<salary datatype = 'int'>
</salary>
</employee>
<employee>
<name>
</name>
<job>
</job>
<salary custom = 'Yes'>
</salary>
<salary datatype = 'int'>
</salary>
</employee>
</department>
</company>
Upvotes: 0
Views: 156
Reputation: 107652
Consider XSLT, the special-purpose language designed to transform XML files, which is supported in pandas.read_xml()
using the default lxml
parser and stylesheet
argument.
Below XSLT will flatten content to employee
level, drawing down ancestor elements, and dynamically handling salary
element depending on attributes. Sample data includes a complete XML capturing top earners in pandas and xml StackOverflow tags and using seafaring ranks.
Do note: the brackets and parentheses are not included being disallowed symbols in XML node names. Change as needed after parsing in Python.
XML
<?xml version="1.0" encoding="UTF-8"?>
<company name="stackoverflow">
<department name="python">
<employee>
<name>Jezrael</name>
<job>Captain</job>
<salary custom="Yes">30463</salary>
<salary datatype="int">806160</salary>
</employee>
<employee>
<name>Ed Chum</name>
<job>Chief Mate</job>
<salary custom="Yes">3058</salary>
<salary datatype="int">369802</salary>
</employee>
<employee>
<name>Andy Hayden</name>
<job>Second Mate</job>
<salary custom="Yes">1959</salary>
<salary datatype="int">353578</salary>
</employee>
<employee>
<name>cs95</name>
<job>Third Mate</job>
<salary custom="Yes">5635</salary>
<salary datatype="int">369908</salary>
</employee>
</department>
<department name="xml">
<employee>
<name>Dimitre Novatchev</name>
<job>Captain</job>
<salary custom="Yes">5142</salary>
<salary datatype="int">239598</salary>
</employee>
<employee>
<name>Jon Skeet</name>
<job>Chief Mate</job>
<salary custom="Yes">35587</salary>
<salary datatype="int">1398836</salary>
</employee>
<employee>
<name>kjhughes</name>
<job>Second Mate</job>
<salary custom="Yes">2954</salary>
<salary datatype="int">104539</salary>
</employee>
<employee>
<name>Michael Kay</name>
<job>Third Mate</job>
<salary custom="Yes">8226</salary>
<salary datatype="int">154024</salary>
</employee>
</department>
</company>
XSLT (save as .xsl script, a special .xml file)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<!-- PARSE ONLY employee LEVEL ELEMENTS -->
<xsl:template match="/company">
<employees>
<xsl:apply-templates select="descendant::employee"/>
</employees>
</xsl:template>
<!-- RESTYLE employee ELEMENTS, DRAW DOWN ANCESTORS -->
<xsl:template match="employee">
<xsl:copy>
<company><xsl:value-of select="ancestor::company/@name"/></company>
<department><xsl:value-of select="ancestor::department/@name"/></department>
<xsl:copy-of select="name|job"/>
<xsl:apply-templates select="salary"/>
</xsl:copy>
</xsl:template>
<!-- DYNAMICALLY HANDLE salary ELEMENTS -->
<xsl:template match="salary">
<xsl:element name="{concat(name(), '_', name(@*), '_', @*)}">
<xsl:value-of select="text()"/>
</xsl:element>
</xsl:template>
</xsl:stylesheet>
Python (requires third-party lxml
package for stylesheet
)
import pandas as pd
employees_df = pd.read_xml("myInput.xml", stylesheet="myScript.xsl")
print(employees_df)
company department name job salary_custom_Yes salary_datatype_int
0 stackoverflow python Jezrael Captain 30463 806160
1 stackoverflow python Ed Chum Chief Mate 3058 369802
2 stackoverflow python Andy Hayden Second Mate 1959 353578
3 stackoverflow python cs95 Third Mate 5635 369908
4 stackoverflow xml Dimitre Novatchev Captain 5142 239598
5 stackoverflow xml Jon Skeet Chief Mate 35587 1398836
6 stackoverflow xml kjhughes Second Mate 2954 104539
7 stackoverflow xml Michael Kay Third Mate 8226 154024
Upvotes: 1