Reputation: 1005
I've been passed a bunch of XML files, however the tags don't have distinct names, rather, all the the record tags ends with rather than the opening name. It looks like this:
<?xml version='1.0' encoding='UTF-8' ?>
<RESULTS>
<ROW>
<COLUMN NAME="CASE_KEY"><![CDATA[1248999]]></COLUMN>
<COLUMN NAME="CASE_DESCRIPTION">[CDATA[MANUAL-GP-[][][02]01 JUL</COLUMN>
<COLUMN NAME="CASE_NARRATIVE"> doing my narrative</COLUMN>
</ROW>
<ROW>
<COLUMN NAME="CASE_KEY"><![CDATA[1248998]]></COLUMN>
<COLUMN NAME="CASE_DESCRIPTION">[CDATA[MANUAL-GP-[][][02]01 AUG</COLUMN>
<COLUMN NAME="CASE_NARRATIVE">another narative stuff</COLUMN>
</ROW>
</RESULTS>
So you can see the root is , and each RECORD is named . That I can work with. Each of the embedded tags ends with though, and I think this is screwing up my code. Here is what I've got:
import xml.etree.cElementTree as et
import pandas as pd
def getvalueofnode(node):
""" return node text or None """
return node.text if node is not None else None
def main():
""" main """
parsed_xml = et.parse("Load.xml")
dfcols = ['ROW','CASE_KEY','CASE_DESCRIPTION','CASE_NARRATIVE']
df_xml = pd.DataFrame(columns=dfcols)
for node in parsed_xml.getroot():
ROW = node.attrib.get('ROW')
CASE_KEY = node.find('CASE_KEY')
CASE_DESCRIPTION = node.find('CASE_DESCRIPTION')
CASE_NARRATIVE = node.find('CASE_NARRATIVE')
df_xml = df_xml.append(
pd.Series([ROW, getvalueofnode(CASE_KEY), getvalueofnode(CASE_DESCRIPTION),
getvalueofnode(CASE_NARRATIVE)], index=dfcols),
ignore_index=True)
print(df_xml)
main()
Here is the output from what I've run:
\c:\python\MonitorScripts\xmlcleaner\NewXML\Alerts>"c:\python\MonitorScripts\xmlcleaner\xmlc.py"
ROW CASE_KEY CASE_DESCRIPTION CASE_NARRATIVE
0 None None None None
I'm guessing it's the strangely named tags that's messing me about, however I have no option to change them to properly formatted tags.
Any suggestions?
Upvotes: 2
Views: 1466
Reputation: 14103
Another option using objectify
:
from lxml import objectify
xml = objectify.parse('file.xml')
root = xml.getroot()
data=[]
for i in range(len(root.getchildren())):
data.append([child.text for child in root.getchildren()[i].getchildren()])
df = pd.DataFrame(data)
df.columns = ['CASE_KEY', 'CASE_DESCRIPTION', 'CASE_NARRATIVE']
Upvotes: 1
Reputation: 84465
I used attribute = value css selectors with bs4
from bs4 import BeautifulSoup
import pandas as pd
doc = """
<?xml version='1.0' encoding='UTF-8' ?>
<RESULTS>
<ROW>
<COLUMN NAME="CASE_KEY"><![CDATA[1248999]]></COLUMN>
<COLUMN NAME="CASE_DESCRIPTION">[CDATA[MANUAL-GP-[][][02]01 JUL</COLUMN>
<COLUMN NAME="CASE_NARRATIVE"> doing my narrative</COLUMN>
</ROW>
<ROW>
<COLUMN NAME="CASE_KEY"><![CDATA[1248998]]></COLUMN>
<COLUMN NAME="CASE_DESCRIPTION">[CDATA[MANUAL-GP-[][][02]01 AUG</COLUMN>
<COLUMN NAME="CASE_NARRATIVE">another narative stuff</COLUMN>
</ROW>
</RESULTS>
"""
soup = BeautifulSoup(doc, 'html.parser')
caseKeys = [case.text for case in soup.select('[name="CASE_KEY"]')]
caseDescriptions = [case.text for case in soup.select('[name="CASE_DESCRIPTION"]')]
caseNarratives = [case.text for case in soup.select('[name="CASE_NARRATIVE"]')]
results = list(zip(caseKeys, caseDescriptions, caseNarratives))
df = pd.DataFrame(results,columns = ['CaseKey','CaseDescription','CaseNarrative'])
print(df)
Upvotes: 1