Python parsing xml to dataframe without disctinctly named TAGs

Question

I've been passed a bunch of XML files, however the tags don't have distinct names, rather, all the the record tags ends with rather than the opening name. It looks like this:



    
        
        [CDATA[MANUAL-GP-[][][02]01 JUL
         doing my narrative
    
    
        
        [CDATA[MANUAL-GP-[][][02]01 AUG
        another narative stuff

So you can see the root is , and each RECORD is named . That I can work with. Each of the embedded tags ends with though, and I think this is screwing up my code. Here is what I've got:

import xml.etree.cElementTree as et
import pandas as pd

def getvalueofnode(node):
    """ return node text or None """
    return node.text if node is not None else None


def main():
    """ main """
    parsed_xml = et.parse("Load.xml")
    dfcols = ['ROW','CASE_KEY','CASE_DESCRIPTION','CASE_NARRATIVE']
    df_xml = pd.DataFrame(columns=dfcols)

    for node in parsed_xml.getroot():
        ROW = node.attrib.get('ROW')
        CASE_KEY = node.find('CASE_KEY')
        CASE_DESCRIPTION = node.find('CASE_DESCRIPTION')
        CASE_NARRATIVE = node.find('CASE_NARRATIVE')

        df_xml = df_xml.append(
            pd.Series([ROW, getvalueofnode(CASE_KEY), getvalueofnode(CASE_DESCRIPTION),
                       getvalueofnode(CASE_NARRATIVE)], index=dfcols),
            ignore_index=True)

    print(df_xml)

main()

Here is the output from what I've run:

\c:\python\MonitorScripts\xmlcleaner\NewXML\Alerts>"c:\python\MonitorScripts\xmlcleaner\xmlc.py"
    ROW CASE_KEY CASE_DESCRIPTION CASE_NARRATIVE
0  None     None             None           None

I'm guessing it's the strangely named tags that's messing me about, however I have no option to change them to properly formatted tags.

Any suggestions?

QHarr · Accepted Answer

I used attribute = value css selectors with bs4

from bs4 import BeautifulSoup
import pandas as pd
doc = """


    
        
        [CDATA[MANUAL-GP-[][][02]01 JUL
         doing my narrative
    
    
        
        [CDATA[MANUAL-GP-[][][02]01 AUG
        another narative stuff
    

"""

soup = BeautifulSoup(doc, 'html.parser')

caseKeys = [case.text for case in soup.select('[name="CASE_KEY"]')]
caseDescriptions = [case.text for case in soup.select('[name="CASE_DESCRIPTION"]')]
caseNarratives = [case.text for case in soup.select('[name="CASE_NARRATIVE"]')]
results = list(zip(caseKeys, caseDescriptions, caseNarratives))
df = pd.DataFrame(results,columns = ['CaseKey','CaseDescription','CaseNarrative'])
print(df)

Python parsing xml to dataframe without disctinctly named TAGs

Answers (2)

Related Questions