Reputation: 45
I'm trying to traverse a nested xml file structure where I'm only interested in certain element values/text. The xml itelf contains "Row" elements, which indicate that values can appear multiple times. The goal is to read/transform this into database records. The xml looks like this:
<CommandManagerResults>
<ListPropertiesAttribute>
<Row>
<Name>AttributeName</Name>
<Id>B31BEF954E05B473A8D3A1B63B29F91E</Id>
<Description>TECHCOLUMNNAME</Description>
<LongDescription/>
<CreationTime>26. August 2010 10:16:10 MESZ</CreationTime>
<ModificationTime>23. November 2017 20:13:37 MEZ</ModificationTime>
<Owner>Administrator</Owner>
<Hidden>False</Hidden>
<Row>
<AttributeFormName>ID</AttributeFormName>
<AttributeFormCategory>ID</AttributeFormCategory>
<AttributeFormType>Number</AttributeFormType>
<AttributeFormDescription/>
<AttributeFormReportSort>None</AttributeFormReportSort>
<AttributeFormBrowseSort>None</AttributeFormBrowseSort>
<AttributeLookUpTable>FACT_TABLE_NAME</AttributeLookUpTable>
<Row>
<SchemaExpression>ApplySimple("nvl(#0, -2)";TECHCOLUMNNAME)</SchemaExpression>
<MappingMethod>Manual</MappingMethod>
<Row>
<SchemaCandidateTable>FACT_TABLE_NAME</SchemaCandidateTable>
</Row>
</Row>
<Multilingual>Unknown</Multilingual>
</Row>
<Row>
<AttributeFormName>DESC</AttributeFormName>
<AttributeFormCategory>DESC</AttributeFormCategory>
<AttributeFormType>Number</AttributeFormType>
<AttributeFormDescription/>
<AttributeFormReportSort>None</AttributeFormReportSort>
<AttributeFormBrowseSort>None</AttributeFormBrowseSort>
<AttributeLookUpTable>FACT_TABLE_NAME</AttributeLookUpTable>
<Row>
<SchemaExpression>TECHCOLUMNNAME</SchemaExpression>
<MappingMethod>Manual</MappingMethod>
<Row>
<SchemaCandidateTable>FACT_TABLE_NAME</SchemaCandidateTable>
</Row>
</Row>
<Multilingual>False</Multilingual>
</Row>
<Row>
<AttributeChild>TABLE_PK</AttributeChild>
<AttributeChildRelationship>One to Many</AttributeChildRelationship>
<AttributeChildTable>FACT_TABLE_NAME</AttributeChildTable>
<Path>\Schema Objects\Attributes\FACT_TABLE_NAME\Star Attributes\_technical</Path>
</Row>
<Row>
<Row>
<AttributeBrowseDisplay>DESC</AttributeBrowseDisplay>
</Row>
<Row>
<ReportDisplayForm>DESC</ReportDisplayForm>
</Row>
<AttributeElementDisplay>Locked</AttributeElementDisplay>
<SecurityFliterToElementBrowsing>True</SecurityFliterToElementBrowsing>
<EnableElementCaching>True</EnableElementCaching>
</Row>
</Row>
</ListPropertiesAttribute>
When writing this as a procedural script (not too pythonic I guess) everything works as desired but I have to repetitively write the same code over and over again:
from lxml import etree
xml = "starattribute_multi.xml"
elem = etree.parse(xml).find("ListPropertiesAttribute")
l=[]
for r in elem.find(".//Row"):
if r.tag == 'Name':
_name = r.text
elif r.tag == "Id":
_id = r.text
elif r.getchildren():
for r1 in r:
if r1.tag == "AttributeFormName":
_attr_form = r1.text
elif r1.tag == "AttributeFormType":
_attr_form_type = r1.text
elif r1.tag == "AttributeFormReportSort":
_form_repsort = r1.text
elif r1.tag == "AttributeFormBrowseSort":
_form_browsesort = r1.text
elif r1.tag == "AttributeLookUpTable":
_attr_lutable = r1.text
elif r1.getchildren():
for r2 in r1:
if r2.tag == "SchemaExpression":
_schema_expr = r2.text
elif r2.tag == "MappingMethod":
_schema_mapping = r2.text
elif r2.getchildren():
for r3 in r2:
if r3.tag == "SchemaCandidateTable":
_schema_table = r3.text
l.append((_name,_id,_attr_form,_attr_form_type,_form_repsort,_form_browsesort,_attr_lutable,_schema_expr,_schema_mapping,_schema_table))
Everything is fine with this, I'm getting my desired list of tuples. Output looks like:
[('AttributeName',
'B31BEF954E05B473A8D3A1B63B29F91E',
'ID',
'Number',
'None',
'None',
'FACT_TABLE_NAME',
'ApplySimple("nvl(#0, -2)";TECHCOLUMNNAME)',
'Manual',
'FACT_TABLE_NAME'),
('AttributeName',
'B31BEF954E05B473A8D3A1B63B29F91E',
'DESC',
'Number',
'None',
'None',
'FACT_TABLE_NAME',
'TECHCOLUMNNAME',
'Manual',
'FACT_TABLE_NAME')]
Now I want to formalize this a bit, to remove my repetitive code and to allow me to process other, similar-but-not-the-same xmls. I thought of writing functions, which can check for my desired tags provided in a search tuple and want to use dictionaries, to later identify the values that had been found.
My function looks like this:
def traverse3(xmlelement,searchelements,dictreturn):
_d=dict()
for row in xmlelement:
if row.getchildren():
traverse3(row,searchelements,_d)
else:
dictreturn[row.tag]=row.text
dictreturn.update(_d)
return dictreturn
The intended usage was then:
from lxml import etree
root = etree.parse("some.xml")
l = []
tags = ('Name', 'Id', 'AttributeFormName', 'AttributeFormType', 'AttributeFormReportSort', 'AttributeFormBrowseSort', 'AttributeLookUpTable', 'SchemaExpression', 'MappingMethod','SchemaCandidateTable')
d = {}
l.append(traverse3(elem,tags,d))
I get only the "last" record back, which is surely because I missed to add a new dict somewhere or to return it earlier or whatever else I'm missing.
[{'Name': 'AttributeName',
'Id': 'B31BEF954E05B473A8D3A1B63B29F91E',
'Description': 'TECHCOLUMNNAME',
'AttributeFormName': 'DESC',
'AttributeFormType': 'Number',
'AttributeFormReportSort': 'None',
'AttributeFormBrowseSort': 'None',
'AttributeLookUpTable': 'FACT_TABLE_NAME',
'SchemaExpression': 'TECHCOLUMNNAME',
'MappingMethod': 'Manual',
'SchemaCandidateTable': 'FACT_TABLE_NAME']
After I added some prints, I can see that my desired record (the one with the ID form) was there during my recursive calls, but it gets overwritten with the other, somewhat similar record for DESC form - which I want as well, of course. I added some functionality where I tried to reduce my searchtag list to have some kind of exit criteria, but all attempts for doing this (or even move around the returns) ended with some "NoneType is not iterable".
I would really appreciate some ideas/directions.
Apologies for this epic question/example in advance.
Upvotes: 3
Views: 114
Reputation: 45
Not sure if answering my own question is deemed as good practice. I've found the solution meanwhile, I've reversed the processing by setting/starting at the deepest element I'm interested in and iterate over the ancestors of this element.
def reverslookup(xmlelement,searchtags):
d={}
d[xmlelement.tag] = xmlelement.text
for parent in xmlelement.iterancestors():
if parent.tag == "Row":
for elem in parent:
if elem.tag in searchtags:
d[elem.tag] = elem.text
return d
if __name__ == "__main__":
from lxml import etree
root = etree.parse("file.xml")
tags = ('Name', 'Id', 'AttributeFormName', 'AttributeFormType', 'AttributeFormReportSort', 'AttributeFormBrowseSort', 'AttributeLookUpTable', 'SchemaExpression', 'MappingMethod','SchemaCandidateTable')
l=[]
for start in root.findall(".//*/SchemaCandidateTable"):
l.append(reverslookup(start,tags))
This way I get the elements above without having to take care of potential repeating tags in the xml file. It gives me the desired list of dictionaries for each record:
Out[6]:
[{'SchemaCandidateTable': 'FACT_TABLE_NAME',
'SchemaExpression': 'ApplySimple("nvl(#0, -2)";TECHCOLUMNNAME)',
'MappingMethod': 'Manual',
'AttributeFormName': 'ID',
'AttributeFormType': 'Number',
'AttributeFormReportSort': 'None',
'AttributeFormBrowseSort': 'None',
'AttributeLookUpTable': 'FACT_TABLE_NAME',
'Name': 'AttributeName',
'Id': 'B31BEF954E05B473A8D3A1B63B29F91E'},
{'SchemaCandidateTable': 'FACT_TABLE_NAME',
'SchemaExpression': 'TECHCOLUMNNAME',
'MappingMethod': 'Manual',
'AttributeFormName': 'DESC',
'AttributeFormType': 'Number',
'AttributeFormReportSort': 'None',
'AttributeFormBrowseSort': 'None',
'AttributeLookUpTable': 'FACT_TABLE_NAME',
'Name': 'AttributeName',
'Id': 'B31BEF954E05B473A8D3A1B63B29F91E'}]
Upvotes: 1