MattB
MattB

Reputation: 155

Process child XML elements with Python Pandas read_xml

I'm trying to load some XML data into Python Pandas, I have never had to deal with XML before but unfortunately it's the only way I can get the data. I've looked around but the way this data is organised doesn't follow most examples so I'm struggling to translate tutorials into my use case.

Here is a truncated form of the data structure.

<?xml version="1.0" encoding="utf-8"?>
<Report xsi:schemaLocation="Trend_x0020_Report http://localhost/ReportServer_ION?%2FION%20Reports%2FION%20Report%20Packs%2FDefault%20Reports%2FTrend%20Report&amp;rs%3AFormat=XML&amp;rc%3ASchema=True" Name="Trend Report" textbox3="Report Data Table" textbox15="ID: 6e5dff70-8a91-48c0-b1db-29f6a43b73dc" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="Trend_x0020_Report">
    <matrix1 textbox28="Timestamp">
        <TimestampGroup_Collection>
            <TimestampGroup textbox30="01/08/2021 00:15:00">
                <SourcesGroup_Collection>
                    <SourcesGroup textbox29="Datacenter.BankA">
                        <MeasurementGroup_Collection>
                            <MeasurementGroup textbox32="C1_L1_Amps ()">
                                <Cell textbox5="10.15" />
                            </MeasurementGroup>
                        </MeasurementGroup_Collection>
                    </SourcesGroup>
                    <SourcesGroup textbox29="Datacenter.BankB">
                        <MeasurementGroup_Collection>
                            <MeasurementGroup textbox32="C1_L1_Amps ()">
                                <Cell textbox5="10.05" />
                            </MeasurementGroup>
                        </MeasurementGroup_Collection>
                    </SourcesGroup>
                </SourcesGroup_Collection>
            </TimestampGroup>
            <TimestampGroup textbox30="01/08/2021 00:30:00">
                <SourcesGroup_Collection>
                    <SourcesGroup textbox29="Datacenter.BankA">
                        <MeasurementGroup_Collection>
                            <MeasurementGroup textbox32="C1_L1_Amps ()">
                                <Cell textbox5="11.51" />
                            </MeasurementGroup>
                        </MeasurementGroup_Collection>
                    </SourcesGroup>
                    <SourcesGroup textbox29="Datacenter.BankB">
                        <MeasurementGroup_Collection>
                            <MeasurementGroup textbox32="C1_L1_Amps ()">
                                <Cell textbox5="11.27" />
                            </MeasurementGroup>
                        </MeasurementGroup_Collection>
                    </SourcesGroup>
                </SourcesGroup_Collection>
            </TimestampGroup>
        </TimestampGroup_Collection>
    </matrix1>
</Report>

I'd like the data to appear as a Pandas DataFrame object like this:

Desired df

An example of the desired DataFrame as a Python Dictionary:

import pandas as pd
data = {
    "datetime": ["01/08/2021 00:15:00", "01/08/2021 00:30:00"],
    "Datacenter.BankA": [10.15, 10.05],
    "Datacenter.BankB": [11.51, 11.27]
}

df = pd.DataFrame(data)

If I load the XML as-is with

df = pd.DataFrame('/path/to/xml')

Pandas seems to do its best to interpret the data but stops at the single key under Report, rather than process the child keys under matrix1.

I've tried various combinations of read_xml options, such as elems_only=True in order to process the child elements but I'm not getting anywhere.

I'm thinking that perhaps I need to create a style sheet to describe the data format but have no idea where to start - any advice welcome.

Thanks.

Upvotes: 0

Views: 165

Answers (1)

balderman
balderman

Reputation: 23815

As far as I understand, the below is what you are looking for. Not exactly what you have posted.

import xml.etree.ElementTree as ET
from collections import defaultdict

import pandas as pd

xml = '''<?xml version="1.0" encoding="utf-8"?>
<Report 
        xsi:schemaLocation="Trend_x0020_Report http://localhost/ReportServer_ION?%2FION%20Reports%2FION%20Report%20Packs%2FDefault%20Reports%2FTrend%20Report&amp;rs%3AFormat=XML&amp;rc%3ASchema=True" 
        Name="Trend Report" textbox3="Report Data Table" textbox15="ID: 6e5dff70-8a91-48c0-b1db-29f6a43b73dc" 
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="Trend_x0020_Report">
    <matrix1 textbox28="Timestamp">
        <TimestampGroup_Collection>
            <TimestampGroup textbox30="01/08/2021 00:15:00">
                <SourcesGroup_Collection>
                    <SourcesGroup textbox29="Datacenter.BankA">
                        <MeasurementGroup_Collection>
                            <MeasurementGroup textbox32="C1_L1_Amps ()">
                                <Cell textbox5="10.15" />
                            </MeasurementGroup>
                        </MeasurementGroup_Collection>
                    </SourcesGroup>
                    <SourcesGroup textbox29="Datacenter.BankB">
                        <MeasurementGroup_Collection>
                            <MeasurementGroup textbox32="C1_L1_Amps ()">
                                <Cell textbox5="10.05" />
                            </MeasurementGroup>
                        </MeasurementGroup_Collection>
                    </SourcesGroup>
                </SourcesGroup_Collection>
            </TimestampGroup>
            <TimestampGroup textbox30="01/08/2021 00:30:00">
                <SourcesGroup_Collection>
                    <SourcesGroup textbox29="Datacenter.BankA">
                        <MeasurementGroup_Collection>
                            <MeasurementGroup textbox32="C1_L1_Amps ()">
                                <Cell textbox5="11.51" />
                            </MeasurementGroup>
                        </MeasurementGroup_Collection>
                    </SourcesGroup>
                    <SourcesGroup textbox29="Datacenter.BankB">
                        <MeasurementGroup_Collection>
                            <MeasurementGroup textbox32="C1_L1_Amps ()">
                                <Cell textbox5="11.27" />
                            </MeasurementGroup>
                        </MeasurementGroup_Collection>
                    </SourcesGroup>
                </SourcesGroup_Collection>
            </TimestampGroup>
        </TimestampGroup_Collection>
    </matrix1>
</Report>'''

NS = "Trend_x0020_Report"
data = defaultdict(list)
root = ET.fromstring(xml)
timestamps = root.findall(f'.//{{{NS}}}TimestampGroup')
for ts in timestamps:
    date = ts.attrib['textbox30']
    data['datetime'].append(date)
    for grp in ts.findall(f'.//{{{NS}}}SourcesGroup'):
        bank_name = grp.attrib['textbox29']
        cell = grp.find(f'.//{{{NS}}}Cell').attrib['textbox5']
        data[bank_name].append(cell)

df = pd.DataFrame(data)
print(df)

output

              datetime Datacenter.BankA Datacenter.BankB
0  01/08/2021 00:15:00            10.15            10.05
1  01/08/2021 00:30:00            11.51            11.27

Upvotes: 1

Related Questions