DECROMAX
DECROMAX

Reputation: 305

Pandas: Parse xml column and split into separate rows

I have a dataframe that looks like this (1 sample row)

    Title   Date    Supervisor  Nightshift  XML
    NaN 27/10/2021  Michael Myres   No    <?xml version="1.0" encoding="utf-8"?><Repeate...

The dataframe has one column that an contains nested XML (below)

    <?xml version="1.0" encoding="utf-8"?>
<RepeaterData>
    <Version />
    <Items>
        <Item>
            <TestReference type="System.String">**1203**</TestReference>
            <Discipline type="System.String">**Bodyshop**</Discipline>
            <WON type="System.String">**1234**</WON>
            <PurchaseOrder type="System.String">**1234**</PurchaseOrder>
            <LabourType type="System.String">**On Tools**</LabourType>
            <WorkType type="System.String">**Turnaround**</WorkType>
            <TestEngineer type="System.String">**Me**</TestEngineer>
            <_x0034_4508821-eb2d-45be-b408-5db4df886890 type="System.String">&amp;lt;?xml version=&amp;quot;1.0&amp;quot; encoding=&amp;quot;utf-8&amp;quot;?&amp;gt;&amp;lt;RepeaterData&amp;gt;&amp;lt;Version /&amp;gt;&amp;lt;Items&amp;gt;&amp;lt;Item&amp;gt;&amp;lt;control_EmployeeName type=&amp;quot;System.String&amp;quot;&amp;gt;**Elliot**&amp;lt;/control_EmployeeName&amp;gt;&amp;lt;control_NHours type=&amp;quot;System.Int32&amp;quot;&amp;gt;**1**&amp;lt;/control_NHours&amp;gt;&amp;lt;control_PHours type=&amp;quot;System.Int32&amp;quot;&amp;gt;**2**&amp;lt;/control_PHours&amp;gt;&amp;lt;control_DHours type=&amp;quot;System.Int32&amp;quot;&amp;gt;**3**&amp;lt;/control_DHours&amp;gt;&amp;lt;/Item&amp;gt;&amp;lt;Item&amp;gt;&amp;lt;control_EmployeeName type=&amp;quot;System.String&amp;quot;&amp;gt;**Ryan** &amp;lt;/control_EmployeeName&amp;gt;&amp;lt;control_NHours type=&amp;quot;System.Int32&amp;quot;&amp;gt;**1**&amp;lt;/control_NHours&amp;gt;&amp;lt;control_PHours type=&amp;quot;System.Int32&amp;quot;&amp;gt;**2**&amp;lt;/control_PHours&amp;gt;&amp;lt;control_DHours type=&amp;quot;System.Int32&amp;quot;&amp;gt;**4**&amp;lt;/control_DHours&amp;gt;&amp;lt;/Item&amp;gt;&amp;lt;Item&amp;gt;&amp;lt;control_EmployeeName type=&amp;quot;System.String&amp;quot;&amp;gt;**Elliot**&amp;lt;/control_EmployeeName&amp;gt;&amp;lt;control_NHours type=&amp;quot;System.Int32&amp;quot;&amp;gt;**1**&amp;lt;/control_NHours&amp;gt;&amp;lt;control_PHours type=&amp;quot;System.Int32&amp;quot;&amp;gt;**1**&amp;lt;/control_PHours&amp;gt;&amp;lt;control_DHours type=&amp;quot;System.Int32&amp;quot;&amp;gt;1&amp;lt;/control_DHours&amp;gt;&amp;lt;/Item&amp;gt;&amp;lt;/Items&amp;gt;&amp;lt;/RepeaterData&amp;gt;</_x0034_4508821-eb2d-45be-b408-5db4df886890>
            <f91faab4-2d9d-4c1f-a40c-63c6a3d71572 type="System.String">&amp;lt;?xml version=&amp;quot;1.0&amp;quot; encoding=&amp;quot;utf-8&amp;quot;?&amp;gt;&amp;lt;RepeaterData&amp;gt;&amp;lt;Version /&amp;gt;&amp;lt;Items&amp;gt;&amp;lt;Item&amp;gt;&amp;lt;NonProductiveTime type=&amp;quot;System.String&amp;quot;&amp;gt;Additional duties&amp;lt;/NonProductiveTime&amp;gt;&amp;lt;NonProductiveTimeMinutes type=&amp;quot;System.Double&amp;quot;&amp;gt;**15**&amp;lt;/NonProductiveTimeMinutes&amp;gt;&amp;lt;/Item&amp;gt;&amp;lt;/Items&amp;gt;&amp;lt;/RepeaterData&amp;gt;</f91faab4-2d9d-4c1f-a40c-63c6a3d71572>
        </Item>
    </Items>
</RepeaterData>

I would like to parse the XML column and split the xml ** values ** into separate rows

Expected result

       Date      Supervisor Nightshift  TestReference Discipline   WON  PurchaseOrder LabourType    WorkType TestEngineer EmployeeName  Nhours  Phours  Dhours
0 2021-10-27  Michael Myres         No           1203   Bodyshop  1234           1234   On Tools  Turnaround           Me       Elliot       1       2       3
1 2021-10-27  Michael Myres         No           1203   Bodyshop  1234           1234   On Tools  Turnaround           Me         Ryan       1       2       4
2 2021-10-27  Michael Myres         No           1203   Bodyshop  1234           1234   On Tools  Turnaround           Me         John       1       1       1

I have tried solution below but it just yields

    Version Items   Version Items   Version Items   Version Items   Version Items   Version Items
    None    None    None    None    None    None    None    None    None    None    None    None

import pandas as pd 
import xml.etree.ElementTree as ET

df = pd.read_csv('SharePointDataPull.csv')

xml_data = df['XML'][0] # test row 
root = ET.XML(xml_data)  # Parse XML

for i, child in enumerate(root):
    data.append([subchild.text for subchild in child])
    cols.append(child.tag)

df1 = pd.DataFrame(data).T  # Write in DF and transpose it
df1.columns = cols  # Update column names
df1

Thanks for your help!

Upvotes: 0

Views: 615

Answers (1)

Joshua Voskamp
Joshua Voskamp

Reputation: 2054

Would you confirm that you're effectively trying to construct the following DataFrame from the 0-indexed 7th tag in each row's XML string?

  EmployeeName  Nhours  Phours  Dhours
0       Elliot       1       2       3
1         Ryan       1       2       4
2         John       1       1       1

And that the list of columns you're looking for the "main" dataframe to contain is as below (SUBFRAME subject to your naming preference)? Or is that list of columns variable for each row?

['Date', 'Supervisor', 'Nightshift', 'TestReference', 'Discipline', 'WON', 'PurchaseOrder', 'LabourType', 'WorkType', 'TestEngineer', 'SUBFRAME']

Snippet below tracking what I have so far:

import html

# change from string of XML data to XML-parsed object
df.XML = df.XML.apply(ET.XML).apply(lambda root: root[1][0])

Upvotes: 1

Related Questions