Reputation: 81
I have a large data frame which I have sampled:
df = pd.DataFrame({'case': [1,1,1,2,2],
'IssueDatetime': ['2010-01-01 09:00:00', '2010-01-01 09:00:00', '2010-01-01 09:00:00','2010-01-01 09:00:00','2010-01-01 09:00:00'],
'endtime':['2010-01-03 03:00:00','2010-01-03 03:00:00','2010-01-03 03:00:00','2010-01-03 03:00:00','2010-01-03 03:00:00'],
'Regions': ['EAST COAST-CAPE ST FRANCIS AND SOUTH', 'EAST COAST-CAPE ST FRANCIS AND SOUTH', 'EAST COAST-CAPE ST FRANCIS AND SOUTH','NORTHEAST COAST','NORTHEAST COAST'],
'forecastTime': ['2010-01-01 09:00:00','2010-01-01 15:00:00','2010-01-01 19:00:00','2010-01-01 09:00:00','2010-01-01 12:00:00'],
'forecast_Dir':[150,180,45,45,45],
'windSpeed':[20,90,35,45,15]})
Each case has an endtime associated with it. Using pandas how can I duplicated the last row for every case and make the endtime the forecastTime for that final row? Desired result:
df_new = pd.DataFrame({'case': [1,1,1,1,2,2,2],
'IssueDatetime': ['2010-01-01 09:00:00','2010-01-01 09:00:00', '2010-01-01 09:00:00', '2010-01-01 09:00:00','2010-01-01 09:00:00','2010-01-01 09:00:00','2010-01-01 09:00:00'],
'endtime':['2010-01-03 03:00:00','2010-01-03 03:00:00','2010-01-03 03:00:00','2010-01-03 03:00:00','2010-01-03 03:00:00','2010-01-03 03:00:00','2010-01-03 03:00:00'],
'Regions': ['EAST COAST-CAPE ST FRANCIS AND SOUTH','EAST COAST-CAPE ST FRANCIS AND SOUTH', 'EAST COAST-CAPE ST FRANCIS AND SOUTH', 'EAST COAST-CAPE ST FRANCIS AND SOUTH','NORTHEAST COAST','NORTHEAST COAST','NORTHEAST COAST'],
'forecastTime': ['2010-01-01 09:00:00','2010-01-01 15:00:00','2010-01-01 19:00:00','2010-01-03 03:00:00','2010-01-01 09:00:00','2010-01-01 12:00:00','2010-01-03 03:00:00'],
'forecast_Dir':[150,180,45,45,45,45,45],
'windSpeed':[20,90,35,35,45,15,15]})
I need to apply this method for the ~13000 cases I have in my original dataframe.
Upvotes: 0
Views: 91
Reputation: 862691
Use DataFrame.drop_duplicates
for last duplicated rows, overwrite column, add to original by concat
and last sorting:
df1 = df.drop_duplicates('case', keep='last').assign(forecastTime = lambda x: x['endtime'])
df2 = pd.concat([df, df1]).sort_index(kind='mergesort').reset_index(drop=True)
print (df2)
case IssueDatetime endtime \
0 1 2010-01-01 09:00:00 2010-01-03 03:00:00
1 1 2010-01-01 09:00:00 2010-01-03 03:00:00
2 1 2010-01-01 09:00:00 2010-01-03 03:00:00
3 1 2010-01-01 09:00:00 2010-01-03 03:00:00
4 2 2010-01-01 09:00:00 2010-01-03 03:00:00
5 2 2010-01-01 09:00:00 2010-01-03 03:00:00
6 2 2010-01-01 09:00:00 2010-01-03 03:00:00
Regions forecastTime forecast_Dir \
0 EAST COAST-CAPE ST FRANCIS AND SOUTH 2010-01-01 09:00:00 150
1 EAST COAST-CAPE ST FRANCIS AND SOUTH 2010-01-01 15:00:00 180
2 EAST COAST-CAPE ST FRANCIS AND SOUTH 2010-01-01 19:00:00 45
3 EAST COAST-CAPE ST FRANCIS AND SOUTH 2010-01-03 03:00:00 45
4 NORTHEAST COAST 2010-01-01 09:00:00 45
5 NORTHEAST COAST 2010-01-01 12:00:00 45
6 NORTHEAST COAST 2010-01-03 03:00:00 45
windSpeed
0 20
1 90
2 35
3 35
4 45
5 15
6 15
Upvotes: 1
Reputation: 2811
An option would be using .groupby
to select the last line and .concat
to join them
s = df.groupby('case').tail(1)
s.loc[:, 'forecastTime'] = s.loc[:, 'endtime']
df_new = pd.concat([df, s], ignore_index=True).sort_values('case')
Upvotes: 1
Reputation: 2417
you could first create a dataframe keeping only the last row and then concatenate the results
added = df.groupby(['case'], as_index=False).last().assign(forecastTime=lambda subdf: subdf.endtime)
pd.concat([df, added]).sort_values(by=['case'])
Upvotes: 1