Reputation: 305
I have a dataframe that looks like this (1 sample row)
Title Date Supervisor Nightshift XML
NaN 27/10/2021 Michael Myres No <?xml version="1.0" encoding="utf-8"?><Repeate...
The dataframe has one column that an contains nested XML (below)
<?xml version="1.0" encoding="utf-8"?>
<RepeaterData>
<Version />
<Items>
<Item>
<TestReference type="System.String">**1203**</TestReference>
<Discipline type="System.String">**Bodyshop**</Discipline>
<WON type="System.String">**1234**</WON>
<PurchaseOrder type="System.String">**1234**</PurchaseOrder>
<LabourType type="System.String">**On Tools**</LabourType>
<WorkType type="System.String">**Turnaround**</WorkType>
<TestEngineer type="System.String">**Me**</TestEngineer>
<_x0034_4508821-eb2d-45be-b408-5db4df886890 type="System.String">&lt;?xml version=&quot;1.0&quot; encoding=&quot;utf-8&quot;?&gt;&lt;RepeaterData&gt;&lt;Version /&gt;&lt;Items&gt;&lt;Item&gt;&lt;control_EmployeeName type=&quot;System.String&quot;&gt;**Elliot**&lt;/control_EmployeeName&gt;&lt;control_NHours type=&quot;System.Int32&quot;&gt;**1**&lt;/control_NHours&gt;&lt;control_PHours type=&quot;System.Int32&quot;&gt;**2**&lt;/control_PHours&gt;&lt;control_DHours type=&quot;System.Int32&quot;&gt;**3**&lt;/control_DHours&gt;&lt;/Item&gt;&lt;Item&gt;&lt;control_EmployeeName type=&quot;System.String&quot;&gt;**Ryan** &lt;/control_EmployeeName&gt;&lt;control_NHours type=&quot;System.Int32&quot;&gt;**1**&lt;/control_NHours&gt;&lt;control_PHours type=&quot;System.Int32&quot;&gt;**2**&lt;/control_PHours&gt;&lt;control_DHours type=&quot;System.Int32&quot;&gt;**4**&lt;/control_DHours&gt;&lt;/Item&gt;&lt;Item&gt;&lt;control_EmployeeName type=&quot;System.String&quot;&gt;**Elliot**&lt;/control_EmployeeName&gt;&lt;control_NHours type=&quot;System.Int32&quot;&gt;**1**&lt;/control_NHours&gt;&lt;control_PHours type=&quot;System.Int32&quot;&gt;**1**&lt;/control_PHours&gt;&lt;control_DHours type=&quot;System.Int32&quot;&gt;1&lt;/control_DHours&gt;&lt;/Item&gt;&lt;/Items&gt;&lt;/RepeaterData&gt;</_x0034_4508821-eb2d-45be-b408-5db4df886890>
<f91faab4-2d9d-4c1f-a40c-63c6a3d71572 type="System.String">&lt;?xml version=&quot;1.0&quot; encoding=&quot;utf-8&quot;?&gt;&lt;RepeaterData&gt;&lt;Version /&gt;&lt;Items&gt;&lt;Item&gt;&lt;NonProductiveTime type=&quot;System.String&quot;&gt;Additional duties&lt;/NonProductiveTime&gt;&lt;NonProductiveTimeMinutes type=&quot;System.Double&quot;&gt;**15**&lt;/NonProductiveTimeMinutes&gt;&lt;/Item&gt;&lt;/Items&gt;&lt;/RepeaterData&gt;</f91faab4-2d9d-4c1f-a40c-63c6a3d71572>
</Item>
</Items>
</RepeaterData>
I would like to parse the XML column and split the xml ** values ** into separate rows
Expected result
Date Supervisor Nightshift TestReference Discipline WON PurchaseOrder LabourType WorkType TestEngineer EmployeeName Nhours Phours Dhours
0 2021-10-27 Michael Myres No 1203 Bodyshop 1234 1234 On Tools Turnaround Me Elliot 1 2 3
1 2021-10-27 Michael Myres No 1203 Bodyshop 1234 1234 On Tools Turnaround Me Ryan 1 2 4
2 2021-10-27 Michael Myres No 1203 Bodyshop 1234 1234 On Tools Turnaround Me John 1 1 1
I have tried solution below but it just yields
Version Items Version Items Version Items Version Items Version Items Version Items
None None None None None None None None None None None None
import pandas as pd
import xml.etree.ElementTree as ET
df = pd.read_csv('SharePointDataPull.csv')
xml_data = df['XML'][0] # test row
root = ET.XML(xml_data) # Parse XML
for i, child in enumerate(root):
data.append([subchild.text for subchild in child])
cols.append(child.tag)
df1 = pd.DataFrame(data).T # Write in DF and transpose it
df1.columns = cols # Update column names
df1
Thanks for your help!
Upvotes: 0
Views: 615
Reputation: 2054
Would you confirm that you're effectively trying to construct the following DataFrame from the 0-indexed 7th tag in each row's XML string?
EmployeeName Nhours Phours Dhours
0 Elliot 1 2 3
1 Ryan 1 2 4
2 John 1 1 1
And that the list of columns you're looking for the "main" dataframe to contain is as below (SUBFRAME
subject to your naming preference)? Or is that list of columns variable for each row?
['Date', 'Supervisor', 'Nightshift', 'TestReference', 'Discipline', 'WON', 'PurchaseOrder', 'LabourType', 'WorkType', 'TestEngineer', 'SUBFRAME']
Snippet below tracking what I have so far:
import html
# change from string of XML data to XML-parsed object
df.XML = df.XML.apply(ET.XML).apply(lambda root: root[1][0])
Upvotes: 1