Reputation: 634
I would like to reformat an Excel file using Pandas.
The Excel file contains a list of ID, for which several Operation are done at different date and on different machine. These data are logged operation by operation and I want to reformat them ID by ID.
I did that code (simplified) that works great but is really not efficient. On my real 15 columns x 20 000 lines Excel file of ~16Mb, it takes ~2/3h to run...
# -*- coding: utf-8 -*-
import pandas as pd
from collections import OrderedDict
data = pd.read_excel('Exemple.xlsx')
IDlist = data.ID.unique().tolist()
for ID in IDlist:
tempData = OrderedDict()
tempData['ID'] = ID
for OP in data[data['ID'] == ID]['Operation'].tolist():
tempData[str(OP)+'_Date'] = data[data['ID'] == ID][data['Operation'] == OP]['Date'].iloc[0].date()
tempData[str(OP)+'_Machine'] = data[data['ID'] == ID][data['Operation'] == OP]['Machine'].iloc[0]
if 'outputData' not in locals():
outputData = pd.DataFrame(tempData, index=[0])
else:
outputData = outputData.append(tempData, ignore_index=True)
writer = pd.ExcelWriter('outputExemple.xlsx')
outputData.to_excel(writer,'sheet',index=False)
writer.save()
Exemple.xlsx is like this (as a csv since it will be easier for you to import) :
ID;Operation;Date;Machine
ID1;10;05/01/2018;Machine1
ID1;20;06/01/2018;Machine2
ID1;30;10/01/2018;Machine3
ID1;40;11/01/2018;Machine4
ID1;50;12/01/2018;Machine5
ID2;10;10/01/2018;Machine1
ID2;20;14/01/2018;Machine2
ID2;30;17/01/2018;Machine3
ID2;50;20/01/2018;Machine5
ID3;10;15/01/2018;Machine1
ID3;30;16/01/2018;Machine3
ID3;50;17/01/2018;Machine5
outputExemple.xlsx - ID by ID (still as a csv...)
ID;10_Date;10_Machine;20_Date;20_Machine;30_Date;30_Machine;40_Date;40_Machine;50_Date;50_Machine
ID1;2018-01-05;Machine1;2018-01-06;Machine2;2018-01-10;Machine3;2018-01-11;Machine4;2018-01-12;Machine5
ID2;2018-01-10;Machine1;2018-01-14;Machine2;2018-01-17;Machine3;;;2018-01-20;Machine5
ID3;2018-01-15;Machine1;;;2018-01-16;Machine3;;;2018-01-17;Machine5
To try and make it faster, I though about having a double index since the combination of both 'ID' & 'Operation' is unique. But I couldn't managed it, and I don't know if it will actually make it faster...
data = data.set_index(['ID', 'Operation'])
Any thought?
Upvotes: 1
Views: 73
Reputation: 107567
Consider pivot_table
with some wrangling of column names without any looping.
Data
from io import StringIO
import pandas as pd
txt = '''ID;Operation;Date;Machine
ID1;10;05/01/2018;Machine1
ID1;20;06/01/2018;Machine2
ID1;30;10/01/2018;Machine3
ID1;40;11/01/2018;Machine4
ID1;50;12/01/2018;Machine5
ID2;10;10/01/2018;Machine1
ID2;20;14/01/2018;Machine2
ID2;30;17/01/2018;Machine3
ID2;50;20/01/2018;Machine5
ID3;10;15/01/2018;Machine1
ID3;30;16/01/2018;Machine3
ID3;50;17/01/2018;Machine5'''
df = pd.read_table(StringIO(txt), sep=";", parse_dates=[2], dayfirst=True)
Process (extend pivot values and currcols for each of the 15-column groupings)
pvt_df = df.pivot_table(index='ID', columns=['Operation'],
values=['Date', 'Machine'], aggfunc='max')
print(pvt_df)
# Date Machine
# Operation 10 20 30 40 50 10 20 30 40 50
# ID
# ID1 2018-01-05 2018-01-06 2018-01-10 2018-01-11 2018-01-12 Machine1 Machine2 Machine3 Machine4 Machine5
# ID2 2018-01-10 2018-01-14 2018-01-17 NaT 2018-01-20 Machine1 Machine2 Machine3 None Machine5
# ID3 2018-01-15 NaT 2018-01-16 NaT 2018-01-17 Machine1 None Machine3 None Machine5
# COLUMN WRANGLING
currcols = [o+'_Date' for o in pvt_df.columns.levels[1].astype('str')] + \
[m+'_Machine' for m in pvt_df.columns.levels[1].astype('str')]
# FLATTEN HIERARCHY
pvt_df.columns = pvt_df.columns.get_level_values(0)
# ASSIGN COLUMNS
pvt_df.columns = currcols
# RE-ORDER COLUMNS
pvt_df = pvt_df[sorted(currcols)]
# OUTPUT SEMI-COLON DELIMITED CSV
pvt_df.to_csv('Output.csv', sep=";")
# ID;10_Date;10_Machine;20_Date;20_Machine;30_Date;30_Machine;40_Date;40_Machine;50_Date;50_Machine
# ID1;2018-01-05;Machine1;2018-01-06;Machine2;2018-01-10;Machine3;2018-01-11;Machine4;2018-01-12;Machine5
# ID2;2018-01-10;Machine1;2018-01-14;Machine2;2018-01-17;Machine3;;;2018-01-20;Machine5
# ID3;2018-01-15;Machine1;;;2018-01-16;Machine3;;;2018-01-17;Machine5
Upvotes: 1