Reputation: 166
Read csv to dataframe and then convert that to xml using lxml library
This is my first time handling xml and it appears that there is partial success. Any help will be highly appreciated.
CSV File used to create dataframe:
Parent,Element,Text,Attribute
,TXLife,"
",{'Version': '2.25.00'}
TXLife,UserAuthRequest,"
",{}
UserAuthRequest,UserLoginName,*****,{}
UserAuthRequest,UserPswd,"
",{}
UserPswd,CryptType,None,{}
UserPswd,Pswd,****,{}
TXLife,TXLifeRequest,"
",{'PrimaryObjectID': 'Policy_1'}
TXLifeRequest,TransRefGUID,706D67C1-CC4D-11CF-91FB444554540000,{}
TXLifeRequest,TransType,Holding Change,{'tc': '502'}
TXLifeRequest,TransExeDate,2006-11-19,{}
TXLifeRequest,TransExeTime,13:15:33-07:00,{}
TXLifeRequest,ChangeSubType,"
",{}
ChangeSubType,ChangeTC,Change Participant,{'tc': '9'}
TXLifeRequest,OLifE,"
",{}
OLifE,Holding,"
",{'id': 'Policy_1'}
Holding,HoldingTypeCode,Policy,{'tc': '2'}
Holding,Policy,"
",{}
Policy,PolNumber,1234567,{}
Policy,LineOfBusiness,Annuity,{'tc': '2'}
Policy,Annuity,,{}
OLifE,Party,"
",{'id': 'Beneficiary_1'}
Party,PartyTypeCode,Organization,{'tc': '2'}
Party,FullName,The Smith Trust,{}
Party,Organization,"
",{}
Organization,OrgForm,Trust,{'tc': '16'}
Organization,DBA,The Smith Trust,{}
OLifE,Relation,"
","{'id': 'Relation_1', 'OriginatingObjectID': 'Policy_1', 'RelatedObjectID': 'Beneficiary_1'}"
Relation,OriginatingObjectType,Holding,{'tc': '4'}
Relation,RelatedObjectType,Party,{'tc': '6'}
Relation,RelationRoleCode,Primary Beneficiary,{'tc': '34'}
Relation,BeneficiaryDesignation,Named,{'tc': '1'}
import lxml.etree as etree
import pandas as pd
import json
# Read the csv file
dfc = pd.read_csv('test_data_txlife.csv') .fillna('NA')
# # Remove rows with comments
# dfc = dfc[~dfc['Element'].str.contains("<cyfunction")].fillna('')
dfc['Attribute'] = dfc['Attribute'].apply(lambda x: x.replace("'", '"'))
# Add the root element for xml
root = etree.Element(dfc['Element'][0])
tree = root.getroottree()
for prnt, elem, txt, attr in dfc[['Parent', 'Element', 'Text', 'Attribute']][1:].values:
# Convert attributes to json (dictionary)
attrib = json.loads(attr)
# list(root) = root.getchildren()
children = [item for item in str(list(root)).split(' ')]
rootstring = str(root).split(' ')[1]
# If the parent is root then add the element as child (appaers to work?)
if prnt == str(root).split(' ')[1]:
parent = etree.SubElement(root, elem)
# If the parent is not root but is one of its children then add the elements to the parent
elif not prnt == rootstring and prnt in children:
child = etree.SubElement(parent, elem, attrib).text = txt
# # If the parent is not in root's descendents then add the childern to the parents
elif not prnt in [str(item).split(' ') for item in root.iterdescendants()]:
child = etree.SubElement(parent, elem, attrib).text = txt
print(etree.tostring(tree, pretty_print=True).decode())
Actual results:
<TXLife> <UserAuthRequest> <UserLoginName>*****</UserLoginName> <UserPswd> </UserPswd> <CryptType>None</CryptType> <Pswd>xxxxxx</Pswd> </UserAuthRequest> <TXLifeRequest> <TransRefGUID>706D67C1-CC4D-11CF-91FB444554540000</TransRefGUID> <TransType tc="502">Holding Change</TransType> <TransExeDate>11/19/2006</TransExeDate> <TransExeTime>13:15:33-07:00</TransExeTime> <ChangeSubType> </ChangeSubType> <ChangeTC tc="9">Change Participant</ChangeTC> <OLifE> </OLifE> <Holding id="Policy_1"> </Holding> <HoldingTypeCode tc="2">Policy</HoldingTypeCode> <Policy> </Policy> <PolNumber>1234567</PolNumber> <LineOfBusiness tc="2">Annuity</LineOfBusiness> <Annuity>NA</Annuity> <Party id="Beneficiary_1"> </Party> <PartyTypeCode tc="2">Organization</PartyTypeCode> <FullName>The Smith Trust</FullName> <Organization> </Organization> <OrgForm tc="16">Trust</OrgForm> <DBA>The Smith Trust</DBA> <Relation OriginatingObjectID="Policy_1" RelatedObjectID="Beneficiary_1" id="Relation_1"> </Relation> <OriginatingObjectType tc="4">Holding</OriginatingObjectType> <RelatedObjectType tc="6">Party</RelatedObjectType> <RelationRoleCode tc="34">Primary Beneficiary</RelationRoleCode> <BeneficiaryDesignation tc="1">Named</BeneficiaryDesignation> </TXLifeRequest> </TXLife>
Desired Results:
<TXLife Version="2.25.00"> <UserAuthRequest> <UserLoginName>*****</UserLoginName> <UserPswd> <CryptType>None</CryptType> <Pswd>****</Pswd> </UserPswd> </UserAuthRequest> <TXLifeRequest PrimaryObjectID="Policy_1"> <TransRefGUID>706D67C1-CC4D-11CF-91FB444554540000</TransRefGUID> <TransType tc="502">Holding Change</TransType> <TransExeDate>2006-11-19</TransExeDate> <TransExeTime>13:15:33-07:00</TransExeTime> <ChangeSubType> <ChangeTC tc="9">Change Participant</ChangeTC> </ChangeSubType> <OLifE> <Holding id="Policy_1"> <HoldingTypeCode tc="2">Policy</HoldingTypeCode> <Policy> <PolNumber>1234567</PolNumber> <LineOfBusiness tc="2">Annuity</LineOfBusiness> <Annuity></Annuity> </Policy> </Holding> <Party id="Beneficiary_1"> <PartyTypeCode tc="2">Organization</PartyTypeCode> <FullName>The Smith Trust</FullName> <Organization> <OrgForm tc="16">Trust</OrgForm> <DBA>The Smith Trust</DBA> </Organization> </Party> <Relation id="Relation_1" OriginatingObjectID="Policy_1" RelatedObjectID="Beneficiary_1"> <OriginatingObjectType tc="4">Holding</OriginatingObjectType> <RelatedObjectType tc="6">Party</RelatedObjectType> <RelationRoleCode tc="34">Primary Beneficiary</RelationRoleCode> <BeneficiaryDesignation tc="1">Named</BeneficiaryDesignation> </Relation> </OLifE> </TXLifeRequest> </TXLife>
How can I get the hierarchical results as showsn above?
Upvotes: 1
Views: 596
Reputation: 2049
You've made a great start! Thought it would be easiest to go through your code bit-by-bit and explain where it needs tweaking, and suggest some improvements:
# Read the csv file
dfc = pd.read_csv('test_data_txlife.csv').fillna('NA')
# # Remove rows with comments
# dfc = dfc[~dfc['Element'].str.contains("<cyfunction")].fillna('')
dfc['Attribute'] = dfc['Attribute'].apply(lambda x: x.replace("'", '"'))
The .apply
works fine, but there's also a .str.replace()
method you can use, which would be a bit neater and clearer (the .str
lets you treat the values of a column as string types and act on them accordingly).
# Add the root element for xml
root = etree.Element(dfc['Element'][0])
tree = root.getroottree()
This is all good!
for prnt, elem, txt, attr in dfc[['Parent', 'Element', 'Text', 'Attribute']][1:].values:
Since you're retrieving all the columns anyway, you don't need to index into dfc
to select them, so you can take that part out:
for prnt, elem, txt, attr in dfc[1:].values:
This works fine, but there are built-in methods for iterating over items in a DataFrame, and we could use itertuples()
. This returns a NamedTuple
for each row, which includes the index (basically the row number) as the first item in the tuple, so we need to adjust for that:
for idx, prnt, elem, txt, attr in dfc[1:].itertuples():
# Convert attributes to json (dictionary)
attrib = json.loads(attr)
# list(root) = root.getchildren()
children = [item for item in str(list(root)).split(' ')]
rootstring = str(root).split(' ')[1][1:].values:
It's a good trick to replace the single quotes with double quotes earlier so we can use json
to turn the attributes into a dictionary.
Each Element
has a .tag
attribute which we can use to get the name, which is what we want here:
children = [item.tag for item in root]
rootstring = root.tag
list(root)
or root.getchildren()
would both give us a list of the child elements of root
, but we can also loop through them using for ... in
with root
like this.
# If the parent is root then add the element as child (appaers to work?)
if prnt == str(root).split(' ')[1]:
parent = etree.SubElement(root, elem)
# If the parent is not root but is one of its children then add the elements to the parent
elif not prnt == rootstring and prnt in children:
child = etree.SubElement(parent, elem, attrib).text = txt
# # If the parent is not in root's descendents then add the childern to the parents
elif not prnt in [str(item).split(' ') for item in root.iterdescendants()]:
child = etree.SubElement(parent, elem, attrib).text = txt
str(root).split(' ')[1]
is exactly what we set rootstring
to be above, so we can use that insteadprnt == rootstring
in the first if
statement, if we've reached the first elif
, we know it can't be equal so we don't need to check it againchild
is set to text
and not the new SubElement
. It would be better to do this in two steps.split()
returns a list), so it won't work. We want the item tag instead.Making all these changes gives us:
# If the parent is root then add the element as child (appaers to work?)
if prnt == rootstring:
parent = etree.SubElement(root, elem)
# If the parent is not root but is one of its children then add the elements to the parent
elif prnt in children:
child = etree.SubElement(parent, elem, attrib)
child.text = txt
# # If the parent is not in root's descendents then add the childern to the parents
elif not prnt in [item.tag for item in root.iterdescendants()]:
child = etree.SubElement(parent, elem, attrib)
child.text = txt
But there's a couple of problems here.
The first part (if
statement) is fine.
In the second part (the first elif
statement), we check if the parent of the new element is one of root's children. If it is, we add the new element as a child of parent
. parent
is definitely one of root's children, but we haven't actually checked if it's the correct one. It's just the last thing we added to root
. Fortunately, because our CSV has all the elements in order, this is the correct one, but it would be better to be more explicit with this.
In the third part (second elif
), it's good to check if prnt
already exists further down the tree. But currently, if prnt
doesn't exist, we're jusr adding the element to parent
, which isn't its actual parent! And if prnt
does exist, we're not adding the element at all (so we'd need an else
clause here).
Thankfully, there's an easy way out: we can use .find()
to find the prnt
element, wherever it is in the tree, and then add the new element on there. This makes the whole thing a lot shorter as well!
for idx, prnt, elem, txt, attr in dfc[1:].itertuples():
# Convert attributes to json (dictionary)
attrib = json.loads(attr)
# Find parent element
if prnt == root.tag:
parent = root
else:
parent = root.find(".//" + prnt)
child = etree.SubElement(parent, elem, attrib)
child.text = txt
The .//
in root.find(".//" + prnt)
means it will search for anywhere in the tree for the matching element tag (read more here: https://lxml.de/tutorial.html#elementpath).
import lxml.etree as etree
import pandas as pd
import json
# Read the csv file
dfc = pd.read_csv('test_data_txlife.csv').fillna("NA")
dfc['Attribute'] = dfc['Attribute'].str.replace("'", '"').apply(lambda s: json.loads(s))
# Add the root element for xml
root = etree.Element(dfc['Element'][0], dfc['Attribute'][0])
for idx, prnt, elem, txt, attr in dfc[1:].itertuples():
# Fix text
text = txt.strip()
if not text:
text = None
# Find parent element
if prnt == root.tag:
parent = root
else:
parent = root.find(".//" + prnt)
# Create element
child = etree.SubElement(parent, elem, attr)
child.text = text
xml_string = etree.tostring(root, pretty_print=True).decode().replace(">NA<", "><")
print(xml_string)
I made a couple more changes:
json.loads
bit for the attribute dictionaries up to when we're changing the quotation marks, and just added it on the end using apply
. We need it there so that the dictionary is ready to go when we create the root element..fillna("")
(filling with an empty string), but if we do that, we end up with </Annuity>
instead of <Annuity></Annuity>
(this is legal XML - if you have an element with no text or subelements, you can just do the closing tag). But to get it to come out as we wanted, we need it to have some 'content' so that the opening tag is created. So I left it as .fillna("NA")
and then right at the end, manually replace that in the output string.It's also good to be aware that this script makes (at least) four assumptions about the input data:
.find()
where there could be more than one match; .find()
always returns the first match)Annuity
element)Upvotes: 1