dataframe to hierarchical xml

Question

Read csv to dataframe and then convert that to xml using lxml library

This is my first time handling xml and it appears that there is partial success. Any help will be highly appreciated.

CSV File used to create dataframe:

Parent,Element,Text,Attribute
,TXLife,"
    ",{'Version': '2.25.00'}
TXLife,UserAuthRequest,"
        ",{}
UserAuthRequest,UserLoginName,*****,{}
UserAuthRequest,UserPswd,"
            ",{}
UserPswd,CryptType,None,{}
UserPswd,Pswd,****,{}
TXLife,TXLifeRequest,"
        ",{'PrimaryObjectID': 'Policy_1'}
TXLifeRequest,TransRefGUID,706D67C1-CC4D-11CF-91FB444554540000,{}
TXLifeRequest,TransType,Holding Change,{'tc': '502'}
TXLifeRequest,TransExeDate,2006-11-19,{}
TXLifeRequest,TransExeTime,13:15:33-07:00,{}
TXLifeRequest,ChangeSubType,"
            ",{}
ChangeSubType,ChangeTC,Change Participant,{'tc': '9'}
TXLifeRequest,OLifE,"
            ",{}
OLifE,Holding,"
                ",{'id': 'Policy_1'}
Holding,HoldingTypeCode,Policy,{'tc': '2'}
Holding,Policy,"
                    ",{}
Policy,PolNumber,1234567,{}
Policy,LineOfBusiness,Annuity,{'tc': '2'}
Policy,Annuity,,{}
OLifE,Party,"
                ",{'id': 'Beneficiary_1'}
Party,PartyTypeCode,Organization,{'tc': '2'}
Party,FullName,The Smith Trust,{}
Party,Organization,"
                    ",{}
Organization,OrgForm,Trust,{'tc': '16'}
Organization,DBA,The Smith Trust,{}
OLifE,Relation,"
                ","{'id': 'Relation_1', 'OriginatingObjectID': 'Policy_1', 'RelatedObjectID': 'Beneficiary_1'}"
Relation,OriginatingObjectType,Holding,{'tc': '4'}
Relation,RelatedObjectType,Party,{'tc': '6'}
Relation,RelationRoleCode,Primary Beneficiary,{'tc': '34'}
Relation,BeneficiaryDesignation,Named,{'tc': '1'}

import lxml.etree as etree
import pandas as pd
import json

# Read the csv file
dfc = pd.read_csv('test_data_txlife.csv') .fillna('NA')
# # Remove rows with comments
# dfc = dfc[~dfc['Element'].str.contains("



Actual results:



  
    *****
    
            
    None
    xxxxxx
  
  
    706D67C1-CC4D-11CF-91FB444554540000
    Holding Change
    11/19/2006
    13:15:33-07:00
    
            
    Change Participant
    
            
    
                
    Policy
    
                    
    1234567
    Annuity
    NA
    
                
    Organization
    The Smith Trust
    
                    
    Trust
    The Smith Trust
    
                
    Holding
    Party
    Primary Beneficiary
    Named
  




Desired Results:



    
        *****
        
            None
            ****
        
    
    
        706D67C1-CC4D-11CF-91FB444554540000
        Holding Change
        2006-11-19
        13:15:33-07:00
        
            Change Participant
        
        
            
                Policy
                
                    1234567
                    Annuity
                    
                
            
            
                Organization
                The Smith Trust
                
                    Trust
                    The Smith Trust
                
            
            
                Holding
                Party
                Primary Beneficiary
                Named
            
        
    




How can I get the hierarchical results as showsn above?

Tim · Accepted Answer

You've made a great start! Thought it would be easiest to go through your code bit-by-bit and explain where it needs tweaking, and suggest some improvements:

Reading and cleaning the Data

# Read the csv file
dfc = pd.read_csv('test_data_txlife.csv').fillna('NA')
# # Remove rows with comments
# dfc = dfc[~dfc['Element'].str.contains("



The .apply works fine, but there's also a .str.replace() method you can use, which would be a bit neater and clearer (the .str lets you treat the values of a column as string types and act on them accordingly).

Adding the root

# Add the root element for xml
root = etree.Element(dfc['Element'][0])
tree = root.getroottree()


This is all good!

Looping over the rows

for prnt, elem, txt, attr in dfc[['Parent', 'Element', 'Text', 'Attribute']][1:].values:


Since you're retrieving all the columns anyway, you don't need to index into dfc to select them, so you can take that part out:

for prnt, elem, txt, attr in dfc[1:].values:


This works fine, but there are built-in methods for iterating over items in a DataFrame, and we could use itertuples(). This returns a NamedTuple for each row, which includes the index (basically the row number) as the first item in the tuple, so we need to adjust for that:

for idx, prnt, elem, txt, attr in dfc[1:].itertuples():


Setting up the variables

    # Convert attributes to json (dictionary)
    attrib = json.loads(attr)
    # list(root) = root.getchildren()
    children = [item for item in str(list(root)).split(' ')]
    rootstring = str(root).split(' ')[1][1:].values:


It's a good trick to replace the single quotes with double quotes earlier so we can use json to turn the attributes into a dictionary.
Each Element has a .tag attribute which we can use to get the name, which is what we want here:

children = [item.tag for item in root]
rootstring = root.tag


list(root) or root.getchildren() would both give us a list of the child elements of root, but we can also loop through them using for ... in with root like this.

Adding the element to the tree

#     If the parent is root then add the element as child (appaers to work?)
    if prnt == str(root).split(' ')[1]:
        parent = etree.SubElement(root, elem)

    # If the parent is not root but is one of its children then add the elements to the parent
    elif not prnt == rootstring and prnt in children:
        child = etree.SubElement(parent, elem, attrib).text = txt

#     # If the parent is not in root's descendents then add the childern to the parents
    elif not prnt in [str(item).split(' ') for item in root.iterdescendants()]:
        child = etree.SubElement(parent, elem, attrib).text = txt



str(root).split(' ')[1] is exactly what we set rootstring to be above, so we can use that instead
Since we already checked if prnt == rootstring in the first if statement, if we've reached the first elif, we know it can't be equal so we don't need to check it again
When we're creating the child, we have two assignments at once... which somehow creates the child with its text successfully(!), but it means child is set to text and not the new SubElement. It would be better to do this in two steps.
When we're looking for the parent, we're currently creating a list of lists (split() returns a list), so it won't work. We want the item tag instead.


Making all these changes gives us:

#     If the parent is root then add the element as child (appaers to work?)
    if prnt == rootstring:
        parent = etree.SubElement(root, elem)

    # If the parent is not root but is one of its children then add the elements to the parent
    elif prnt in children:
        child = etree.SubElement(parent, elem, attrib)
        child.text = txt

#     # If the parent is not in root's descendents then add the childern to the parents
    elif not prnt in [item.tag for item in root.iterdescendants()]:
        child = etree.SubElement(parent, elem, attrib)
        child.text = txt


But there's a couple of problems here.

The first part (if statement) is fine.

In the second part (the first elif statement), we check if the parent of the new element is one of root's children. If it is, we add the new element as a child of parent. parent is definitely one of root's children, but we haven't actually checked if it's the correct one. It's just the last thing we added to root. Fortunately, because our CSV has all the elements in order, this is the correct one, but it would be better to be more explicit with this.

In the third part (second elif), it's good to check if prnt already exists further down the tree. But currently, if prnt doesn't exist, we're jusr adding the element to parent, which isn't its actual parent! And if prnt does exist, we're not adding the element at all (so we'd need an else clause here).

Solution

Thankfully, there's an easy way out: we can use .find() to find the prnt element, wherever it is in the tree, and then add the new element on there. This makes the whole thing a lot shorter as well!

for idx, prnt, elem, txt, attr in dfc[1:].itertuples():
    # Convert attributes to json (dictionary)
    attrib = json.loads(attr)
    # Find parent element
    if prnt == root.tag:
        parent = root
    else:
        parent = root.find(".//" + prnt)
    child = etree.SubElement(parent, elem, attrib)
    child.text = txt


The .// in root.find(".//" + prnt) means it will search for anywhere in the tree for the matching element tag (read more here: https://lxml.de/tutorial.html#elementpath).



Final script

import lxml.etree as etree
import pandas as pd
import json

# Read the csv file
dfc = pd.read_csv('test_data_txlife.csv').fillna("NA")
dfc['Attribute'] = dfc['Attribute'].str.replace("'", '"').apply(lambda s: json.loads(s))

# Add the root element for xml
root = etree.Element(dfc['Element'][0], dfc['Attribute'][0])

for idx, prnt, elem, txt, attr in dfc[1:].itertuples():
    # Fix text
    text = txt.strip()
    if not text:
        text = None
    # Find parent element
    if prnt == root.tag:
        parent = root
    else:
        parent = root.find(".//" + prnt)
    # Create element
    child = etree.SubElement(parent, elem, attr)
    child.text = text

xml_string = etree.tostring(root, pretty_print=True).decode().replace(">NA<", "><")
print(xml_string)


I made a couple more changes:


I moved the json.loads bit for the attribute dictionaries up to when we're changing the quotation marks, and just added it on the end using apply. We need it there so that the dictionary is ready to go when we create the root element.
There are some issues with getting the pretty print to work properly, which is what the "Fix text" part is for (see this Stack Overflow question for the problem I was having).
It would be neatest to have .fillna("") (filling with an empty string), but if we do that, we end up with  instead of  (this is legal XML - if you have an element with no text or subelements, you can just do the closing tag). But to get it to come out as we wanted, we need it to have some 'content' so that the opening tag is created. So I left it as .fillna("NA") and then right at the end, manually replace that in the output string.


It's also good to be aware that this script makes (at least) four assumptions about the input data:


That parent elements are created before any of their children (i.e. that they occur further up in the CSV file)
That element names are unique (or at least, that any duplicate names don't have any children, so that we're never doing .find() where there could be more than one match; .find() always returns the first match)
That there aren't any text values of 'NA' that you want to preserve in the final XML (they'll also get removed when we remove the spurious 'NA' text from the Annuity element)
That text consisting only of whitespace doesn't need to be preserved

dataframe to hierarchical xml

Answers (1)

Reading and cleaning the Data

Adding the root

Looping over the rows

Setting up the variables

Adding the element to the tree

Solution

Final script

Related Questions