Reputation: 155
I'm trying to load some XML data into Python Pandas, I have never had to deal with XML before but unfortunately it's the only way I can get the data. I've looked around but the way this data is organised doesn't follow most examples so I'm struggling to translate tutorials into my use case.
Here is a truncated form of the data structure.
<?xml version="1.0" encoding="utf-8"?>
<Report xsi:schemaLocation="Trend_x0020_Report http://localhost/ReportServer_ION?%2FION%20Reports%2FION%20Report%20Packs%2FDefault%20Reports%2FTrend%20Report&rs%3AFormat=XML&rc%3ASchema=True" Name="Trend Report" textbox3="Report Data Table" textbox15="ID: 6e5dff70-8a91-48c0-b1db-29f6a43b73dc" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="Trend_x0020_Report">
<matrix1 textbox28="Timestamp">
<TimestampGroup_Collection>
<TimestampGroup textbox30="01/08/2021 00:15:00">
<SourcesGroup_Collection>
<SourcesGroup textbox29="Datacenter.BankA">
<MeasurementGroup_Collection>
<MeasurementGroup textbox32="C1_L1_Amps ()">
<Cell textbox5="10.15" />
</MeasurementGroup>
</MeasurementGroup_Collection>
</SourcesGroup>
<SourcesGroup textbox29="Datacenter.BankB">
<MeasurementGroup_Collection>
<MeasurementGroup textbox32="C1_L1_Amps ()">
<Cell textbox5="10.05" />
</MeasurementGroup>
</MeasurementGroup_Collection>
</SourcesGroup>
</SourcesGroup_Collection>
</TimestampGroup>
<TimestampGroup textbox30="01/08/2021 00:30:00">
<SourcesGroup_Collection>
<SourcesGroup textbox29="Datacenter.BankA">
<MeasurementGroup_Collection>
<MeasurementGroup textbox32="C1_L1_Amps ()">
<Cell textbox5="11.51" />
</MeasurementGroup>
</MeasurementGroup_Collection>
</SourcesGroup>
<SourcesGroup textbox29="Datacenter.BankB">
<MeasurementGroup_Collection>
<MeasurementGroup textbox32="C1_L1_Amps ()">
<Cell textbox5="11.27" />
</MeasurementGroup>
</MeasurementGroup_Collection>
</SourcesGroup>
</SourcesGroup_Collection>
</TimestampGroup>
</TimestampGroup_Collection>
</matrix1>
</Report>
I'd like the data to appear as a Pandas DataFrame object like this:
An example of the desired DataFrame as a Python Dictionary:
import pandas as pd
data = {
"datetime": ["01/08/2021 00:15:00", "01/08/2021 00:30:00"],
"Datacenter.BankA": [10.15, 10.05],
"Datacenter.BankB": [11.51, 11.27]
}
df = pd.DataFrame(data)
If I load the XML as-is with
df = pd.DataFrame('/path/to/xml')
Pandas seems to do its best to interpret the data but stops at the single key
under Report
, rather than process the child keys under matrix1
.
I've tried various combinations of read_xml
options, such as elems_only=True
in order to process the child elements but I'm not getting anywhere.
I'm thinking that perhaps I need to create a style sheet to describe the data format but have no idea where to start - any advice welcome.
Thanks.
Upvotes: 0
Views: 165
Reputation: 23815
As far as I understand, the below is what you are looking for. Not exactly what you have posted.
import xml.etree.ElementTree as ET
from collections import defaultdict
import pandas as pd
xml = '''<?xml version="1.0" encoding="utf-8"?>
<Report
xsi:schemaLocation="Trend_x0020_Report http://localhost/ReportServer_ION?%2FION%20Reports%2FION%20Report%20Packs%2FDefault%20Reports%2FTrend%20Report&rs%3AFormat=XML&rc%3ASchema=True"
Name="Trend Report" textbox3="Report Data Table" textbox15="ID: 6e5dff70-8a91-48c0-b1db-29f6a43b73dc"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="Trend_x0020_Report">
<matrix1 textbox28="Timestamp">
<TimestampGroup_Collection>
<TimestampGroup textbox30="01/08/2021 00:15:00">
<SourcesGroup_Collection>
<SourcesGroup textbox29="Datacenter.BankA">
<MeasurementGroup_Collection>
<MeasurementGroup textbox32="C1_L1_Amps ()">
<Cell textbox5="10.15" />
</MeasurementGroup>
</MeasurementGroup_Collection>
</SourcesGroup>
<SourcesGroup textbox29="Datacenter.BankB">
<MeasurementGroup_Collection>
<MeasurementGroup textbox32="C1_L1_Amps ()">
<Cell textbox5="10.05" />
</MeasurementGroup>
</MeasurementGroup_Collection>
</SourcesGroup>
</SourcesGroup_Collection>
</TimestampGroup>
<TimestampGroup textbox30="01/08/2021 00:30:00">
<SourcesGroup_Collection>
<SourcesGroup textbox29="Datacenter.BankA">
<MeasurementGroup_Collection>
<MeasurementGroup textbox32="C1_L1_Amps ()">
<Cell textbox5="11.51" />
</MeasurementGroup>
</MeasurementGroup_Collection>
</SourcesGroup>
<SourcesGroup textbox29="Datacenter.BankB">
<MeasurementGroup_Collection>
<MeasurementGroup textbox32="C1_L1_Amps ()">
<Cell textbox5="11.27" />
</MeasurementGroup>
</MeasurementGroup_Collection>
</SourcesGroup>
</SourcesGroup_Collection>
</TimestampGroup>
</TimestampGroup_Collection>
</matrix1>
</Report>'''
NS = "Trend_x0020_Report"
data = defaultdict(list)
root = ET.fromstring(xml)
timestamps = root.findall(f'.//{{{NS}}}TimestampGroup')
for ts in timestamps:
date = ts.attrib['textbox30']
data['datetime'].append(date)
for grp in ts.findall(f'.//{{{NS}}}SourcesGroup'):
bank_name = grp.attrib['textbox29']
cell = grp.find(f'.//{{{NS}}}Cell').attrib['textbox5']
data[bank_name].append(cell)
df = pd.DataFrame(data)
print(df)
output
datetime Datacenter.BankA Datacenter.BankB
0 01/08/2021 00:15:00 10.15 10.05
1 01/08/2021 00:30:00 11.51 11.27
Upvotes: 1