Python XML to records

Question

I'm trying to traverse a nested xml file structure where I'm only interested in certain element values/text. The xml itelf contains "Row" elements, which indicate that values can appear multiple times. The goal is to read/transform this into database records. The xml looks like this:



    
        AttributeName
        B31BEF954E05B473A8D3A1B63B29F91E
        TECHCOLUMNNAME
        
        26. August 2010 10:16:10 MESZ
        23. November 2017 20:13:37 MEZ
        Administrator
        False
        
            ID
            ID
            Number
            
            None
            None
            FACT_TABLE_NAME
            
                ApplySimple("nvl(#0, -2)";TECHCOLUMNNAME)
                Manual
                
                    FACT_TABLE_NAME
                
            
            Unknown
        
        
            DESC
            DESC
            Number
            
            None
            None
            FACT_TABLE_NAME
            
                TECHCOLUMNNAME
                Manual
                
                    FACT_TABLE_NAME
                
            
            False
        
        
            TABLE_PK
            One to Many
            FACT_TABLE_NAME
            \Schema Objects\Attributes\FACT_TABLE_NAME\Star Attributes\_technical
        
        
            
                DESC
            
            
                DESC
            
            Locked
            True
            True

When writing this as a procedural script (not too pythonic I guess) everything works as desired but I have to repetitively write the same code over and over again:

from lxml import etree
xml = "starattribute_multi.xml"
elem = etree.parse(xml).find("ListPropertiesAttribute")

l=[]
for r in elem.find(".//Row"):

if r.tag == 'Name':
    _name = r.text

elif r.tag == "Id":
    _id = r.text

elif r.getchildren():

    for r1 in r:

        if r1.tag == "AttributeFormName":
            _attr_form = r1.text

        elif r1.tag == "AttributeFormType":
            _attr_form_type = r1.text

        elif r1.tag == "AttributeFormReportSort":
            _form_repsort = r1.text

        elif r1.tag == "AttributeFormBrowseSort":
            _form_browsesort = r1.text

        elif r1.tag == "AttributeLookUpTable":
            _attr_lutable = r1.text

        elif r1.getchildren():

            for r2 in r1:

                if r2.tag == "SchemaExpression":
                    _schema_expr = r2.text

                elif r2.tag == "MappingMethod":
                    _schema_mapping = r2.text

                elif r2.getchildren():

                    for r3 in r2:

                        if r3.tag == "SchemaCandidateTable":
                            _schema_table = r3.text

                        l.append((_name,_id,_attr_form,_attr_form_type,_form_repsort,_form_browsesort,_attr_lutable,_schema_expr,_schema_mapping,_schema_table))

Everything is fine with this, I'm getting my desired list of tuples. Output looks like:

    [('AttributeName',
  'B31BEF954E05B473A8D3A1B63B29F91E',
  'ID',
  'Number',
  'None',
  'None',
  'FACT_TABLE_NAME',
  'ApplySimple("nvl(#0, -2)";TECHCOLUMNNAME)',
  'Manual',
  'FACT_TABLE_NAME'),
 ('AttributeName',
  'B31BEF954E05B473A8D3A1B63B29F91E',
  'DESC',
  'Number',
  'None',
  'None',
  'FACT_TABLE_NAME',
  'TECHCOLUMNNAME',
  'Manual',
  'FACT_TABLE_NAME')]

Now I want to formalize this a bit, to remove my repetitive code and to allow me to process other, similar-but-not-the-same xmls. I thought of writing functions, which can check for my desired tags provided in a search tuple and want to use dictionaries, to later identify the values that had been found.

My function looks like this:

def traverse3(xmlelement,searchelements,dictreturn):
_d=dict()
for row in xmlelement:
    if row.getchildren():
        traverse3(row,searchelements,_d)
    else:
        dictreturn[row.tag]=row.text
    dictreturn.update(_d)
return dictreturn

The intended usage was then:

from lxml import etree
root = etree.parse("some.xml")
l = []
tags = ('Name', 'Id', 'AttributeFormName', 'AttributeFormType', 'AttributeFormReportSort', 'AttributeFormBrowseSort', 'AttributeLookUpTable', 'SchemaExpression', 'MappingMethod','SchemaCandidateTable')
d = {}
l.append(traverse3(elem,tags,d))

I get only the "last" record back, which is surely because I missed to add a new dict somewhere or to return it earlier or whatever else I'm missing.

[{'Name': 'AttributeName',
  'Id': 'B31BEF954E05B473A8D3A1B63B29F91E',
  'Description': 'TECHCOLUMNNAME',
  'AttributeFormName': 'DESC',
  'AttributeFormType': 'Number',
  'AttributeFormReportSort': 'None',
  'AttributeFormBrowseSort': 'None',
  'AttributeLookUpTable': 'FACT_TABLE_NAME',
  'SchemaExpression': 'TECHCOLUMNNAME',
  'MappingMethod': 'Manual',
  'SchemaCandidateTable': 'FACT_TABLE_NAME']

After I added some prints, I can see that my desired record (the one with the ID form) was there during my recursive calls, but it gets overwritten with the other, somewhat similar record for DESC form - which I want as well, of course. I added some functionality where I tried to reduce my searchtag list to have some kind of exit criteria, but all attempts for doing this (or even move around the returns) ended with some "NoneType is not iterable".

I would really appreciate some ideas/directions.

Apologies for this epic question/example in advance.

Python XML to records

Answers (1)

Related Questions