Reputation: 31
Here is a small section of an xml file. I would like to create a database from this with each tag unique columns names and non-duplicated data.
Tried using lxml and the best I have been able to do so far is to create a dataframe that results in something like this:
"
SRCSGT
DATE 11112017
AGENCY Department of Veterans Affairs
OFFICE Canandaigua VAMC
LOCATION Department of Veterans Affairs Medical Center
ZIP 14424
etc, etc, "
<?xml version="1.0" encoding="UTF-8"?>
<NOTICES>
<SRCSGT>
<DATE>11112017</DATE>
<AGENCY><![CDATA[Department of Veterans Affairs]]></AGENCY>
<OFFICE><![CDATA[Canandaigua VAMC]]></OFFICE>
<LOCATION><![CDATA[Department of Veterans Affairs Medical Center]]></LOCATION>
<ZIP>14424</ZIP>
<CLASSCOD>H</CLASSCOD>
<NAICS>238210</NAICS>
<OFFADD><![CDATA[Department of Veterans Affairs;400 Fort Hill Ave.;Canandaigua NY 14424]]></OFFADD>
<SUBJECT><![CDATA[H--3 YEAR TESTING/MAINTENANCE OF ELECTRICAL EQUIPMENT AT THE SYRACUSE VA MEDICAL CENTER AND THE ROME COMMUNITY BASED OUTPATIENT CLINIC ]]></SUBJECT>
<SOLNBR><![CDATA[9069]]></SOLNBR>
<RESPDATE>11172017</RESPDATE>
<ARCHDATE>12172017</ARCHDATE>
<CONTACT><![CDATA[COiyiyS, JUhhiuN<a href="mailto:[email protected]">CONTRACT SPECIALIST</a>]]></CONTACT>
<DESC><![CDATA[This is a Sources Sought Notice. (a) The Government does not intend to award a contract on the basis of this Sources Sought or to otherwise pay for the information solicited.(b) Although "proposal," "offeror," contractor, and "offeror" may be used in this sources sought notice, any response will be treated as information only. It shall not be used as a proposal.Attachment(s) if applicable. ]]></DESC>
<LINK><![CDATA[https://www.fbo.gov/spg/VA/CaVAMC532/CaVAMC532/9069/listing.html]]></LINK>
<EMAIL>
<ADDRESS><![CDATA[[email protected]]]></ADDRESS>
<DESC><![CDATA[CONTRACT SPECIALIST]]></DESC>
</EMAIL>
<SETASIDE>N/A</SETASIDE>
<RECOVERY_ACT>N</RECOVERY_ACT>
<DOCUMENT_PACKAGES>
<PACKAGE><![CDATA[Attachment]]></PACKAGE>
</DOCUMENT_PACKAGES>
</SRCSGT>
</NOTICES>
from lxml import etree as et
import pandas as pd
trees = et.parse('test.xml') #get xml file
root = trees.getroot() #get to root of file
tags = [] #list for holding all tags
datas = [] #list for holding all data in tags
for child in root: #root is a list of all elements in the xml file
#print(child.tag)
tt = child.tag #reads xml tag
tags.append(tt)
datas.append(child.text) #read xml tag data
for c in child.findall('./'): # ./ finds children
tt1 = c.tag
tags.append(str(tt1))
datas.append(c.text)
for i in c.findall('./'): #each child node loads a new list of elements
tt2 = i.tag
tags.append(str(tt1)+ '_' + str(tt2))
datas.append(i.text)
for j in i.findall('./'):
tt3 = j.tag
tags.append(str(tt1)+ '_' + str(tt2) + '_' + str(tt3))
datas.append(j.text)
for k in j.findall('./'):
tt4 = k.tag
tags.append(str(tt1)+ '_' + str(tt2) + '_' + str(tt3) + '_' + str(tt4))
datas.append(k.text)
df = pd.DataFrame({"tags": tags,"values": datas})
The desired solution is something like this
date agency office
1/1/10 A1 O1
1/1/10 A2 O2
1/1/10 A3 O3
So basically the tags should turn into column headers and must be populated. The column names should not be repeated so I can create a standard database table.
Upvotes: 1
Views: 2879
Reputation: 107652
Consider nested xpath loops where first you loop through every <SCRSGT>
nodes and then extract all SCRSGT's children using an inner dictionary that iteratively appends to a list for DataFrame
call:
from lxml import etree as et
import pandas as pd
trees = et.parse('test.xml')
d = []
for srcsgt in trees.xpath('//SRCSGT'): # ITERATE THROUGH ROOT'S CHILDREN
inner = {}
for elem in srcsgt.xpath('//*'): # ITERATE THROUGH ROOT'S DESCENDANTS PER CHILD
if len(elem.text.strip()) > 0: # KEEP ONLY NODES WITH NON-ZERO LENGTH TEXT
inner[elem.tag] = elem.text
d.append(inner)
df = pd.DataFrame(d)
Output
print(df)
# ADDRESS AGENCY ARCHDATE CLASSCOD \
# 0 [email protected] Department of Veterans Affairs 12172017 H
# CONTACT DATE \
# 0 COiyiyS, JUhhiuN<a href="mailto:[email protected]... 11112017
# DESC LINK \
# 0 CONTRACT SPECIALIST https://www.fbo.gov/spg/VA/CaVAMC532/CaVAMC532...
# LOCATION NAICS \
# 0 Department of Veterans Affairs Medical Center 238210
# OFFADD OFFICE \
# 0 Department of Veterans Affairs;400 Fort Hill A... Canandaigua VAMC
# PACKAGE RECOVERY_ACT RESPDATE SETASIDE SOLNBR \
# 0 Attachment N 11172017 N/A 9069
# SUBJECT ZIP
# 0 H--3 YEAR TESTING/MAINTENANCE OF ELECTRICAL EQ... 14424
Upvotes: 1