Reputation: 25
I'm trying to parse an "xml" file in python for a project.
I want the code to parse through the xml and grab information for each Procedure. These information will be returned as a python dictionary.
Specifically, I will transverse down through each Procedure element and get information on its Data# name and types.
Currently, my code is as below.
The issue is Data2 is not of the right object type so I can't transverse into the Variable layer.
I don't understand why I can't keep using getElementsByTagName to go down through each layer.
In the full code I'll be doing it for each Data# and I should expect 'none' or empty nodes specified for a Procedure. The code should then be expected to handle that (not sure how to handle it when there is nothing either besides checking if Data2Element). Its fine it the suggested solution uses another methodology.
Hence the question is how should I handle empty nodes in a xml document in python.
Note: I have no control over the file format, I have 'standard' python 3.3 modules so that includes xml.dom and xml.etree, additionally I have Beautiful Soup (but no lxml). I cannot install 'lxml' or anything else that's not already installed. I'm happy to switch to one of the other installed modules if that's needed for my solution.
filename = 'TestProc.xml'
from xml.dom import minidom
xmldoc = minidom.parse(filename)
procedureList = xmldoc.getElementsByTagName('Procedure')
varName=[]
varType=[]
for procElement in procedureList:
Data2 = procElement.getElementsByTagName('Data2')
varElements = Data2.getElementsByTagName('Variable')
for varElemTmp in varElements:
varName.append(varElemTmp.getAttribute('name'))
varType.append(varElemTmp.getAttribute('type'))
Where TestProc.xml is the following.
<?xml version="1.0" encoding="utf-8"?>
<ProcedureSet xmlns:xs="htt//www.w3.org/2001/XMLSchema">
<GlobalCode>
<CodeBlock id="Code1">
</CodeBlock>
<CodeBlock id="Code2">
</CodeBlock>
<CodeBlock id="Code3">
</CodeBlock>
</GlobalCode>
<Procedures>
<Procedure id="Proc1" displayToUser="false" expectedType="Type1">
<Description>Description1.</Description>
<Data1 />
<Data2 />
<Data3 />
<Data4 />
<MainCode id="main">
Junk1
</MainCode>
</Procedure>
<Procedure id="Proc2" displayToUser="false" expectedType="Type2">
<Description>Description2.</Description>
<Data1 />
<Data2>
<Variable name="Var1" type="bool" causesChange="false">
<description>Description3</description>
</Variable>
</Data2>
<Data3>
<Variable name="Var2" type="bool" causesChange="false">
<description>Description4</description>
</Variable>
<Variable name="Var3" type="int" causesChange="false">
<description>Description5</description>
</Variable>
</Data3>
<Data4>
<Variable name="Var4" type="link" />
<Variable name="Var5" type="link" />
</Data4>
<MainCode id="main">
Junk2
</MainCode>
</Procedure>
</Procedures>
</ProcedureSet>
Upvotes: 1
Views: 1975
Reputation: 168716
Data2
is a list of elements, not a single element. You could modify your code like so:
for procElement in procedureList:
ListOfData2 = procElement.getElementsByTagName('Data2')
for Data2 in ListOfData2:
varElements = Data2.getElementsByTagName('Variable')
for varElemTmp in varElements:
varName.append(varElemTmp.getAttribute('name'))
varType.append(varElemTmp.getAttribute('type'))
If you do switch to ElementTree, you can save yourself some looping by using XPath syntax:
filename = 'TestProc.xml'
import xml.etree.ElementTree as ET
xmldoc = ET.parse(filename)
variables = xmldoc.findall(".//Procedure/Data2/Variable")
varName=[e.get('name') for e in variables]
varType=[e.get('type') for e in variables]
print varName, varType
Upvotes: 1