Jordan Ford
Jordan Ford

Reputation: 81

Add extra row for each group in dataframe - pandas

I have a large data frame which I have sampled:

df = pd.DataFrame({'case': [1,1,1,2,2], 
                   'IssueDatetime': ['2010-01-01 09:00:00', '2010-01-01 09:00:00', '2010-01-01 09:00:00','2010-01-01 09:00:00','2010-01-01 09:00:00'],
                   'endtime':['2010-01-03 03:00:00','2010-01-03 03:00:00','2010-01-03 03:00:00','2010-01-03 03:00:00','2010-01-03 03:00:00'],
                   'Regions': ['EAST COAST-CAPE ST FRANCIS AND SOUTH', 'EAST COAST-CAPE ST FRANCIS AND SOUTH', 'EAST COAST-CAPE ST FRANCIS AND SOUTH','NORTHEAST COAST','NORTHEAST COAST'],
                  'forecastTime': ['2010-01-01 09:00:00','2010-01-01 15:00:00','2010-01-01 19:00:00','2010-01-01 09:00:00','2010-01-01 12:00:00'],
                   'forecast_Dir':[150,180,45,45,45],
                   'windSpeed':[20,90,35,45,15]})

Each case has an endtime associated with it. Using pandas how can I duplicated the last row for every case and make the endtime the forecastTime for that final row? Desired result:

df_new = pd.DataFrame({'case': [1,1,1,1,2,2,2], 
                   'IssueDatetime': ['2010-01-01 09:00:00','2010-01-01 09:00:00', '2010-01-01 09:00:00', '2010-01-01 09:00:00','2010-01-01 09:00:00','2010-01-01 09:00:00','2010-01-01 09:00:00'],
                   'endtime':['2010-01-03 03:00:00','2010-01-03 03:00:00','2010-01-03 03:00:00','2010-01-03 03:00:00','2010-01-03 03:00:00','2010-01-03 03:00:00','2010-01-03 03:00:00'],
                   'Regions': ['EAST COAST-CAPE ST FRANCIS AND SOUTH','EAST COAST-CAPE ST FRANCIS AND SOUTH', 'EAST COAST-CAPE ST FRANCIS AND SOUTH', 'EAST COAST-CAPE ST FRANCIS AND SOUTH','NORTHEAST COAST','NORTHEAST COAST','NORTHEAST COAST'],
                  'forecastTime': ['2010-01-01 09:00:00','2010-01-01 15:00:00','2010-01-01 19:00:00','2010-01-03 03:00:00','2010-01-01 09:00:00','2010-01-01 12:00:00','2010-01-03 03:00:00'],
                   'forecast_Dir':[150,180,45,45,45,45,45],
                   'windSpeed':[20,90,35,35,45,15,15]})

I need to apply this method for the ~13000 cases I have in my original dataframe.

Upvotes: 0

Views: 91

Answers (3)

jezrael
jezrael

Reputation: 862691

Use DataFrame.drop_duplicates for last duplicated rows, overwrite column, add to original by concat and last sorting:

df1 = df.drop_duplicates('case', keep='last').assign(forecastTime = lambda x: x['endtime'])
df2 = pd.concat([df, df1]).sort_index(kind='mergesort').reset_index(drop=True)
print (df2)
   case        IssueDatetime              endtime  \
0     1  2010-01-01 09:00:00  2010-01-03 03:00:00   
1     1  2010-01-01 09:00:00  2010-01-03 03:00:00   
2     1  2010-01-01 09:00:00  2010-01-03 03:00:00   
3     1  2010-01-01 09:00:00  2010-01-03 03:00:00   
4     2  2010-01-01 09:00:00  2010-01-03 03:00:00   
5     2  2010-01-01 09:00:00  2010-01-03 03:00:00   
6     2  2010-01-01 09:00:00  2010-01-03 03:00:00   

                                Regions         forecastTime  forecast_Dir  \
0  EAST COAST-CAPE ST FRANCIS AND SOUTH  2010-01-01 09:00:00           150   
1  EAST COAST-CAPE ST FRANCIS AND SOUTH  2010-01-01 15:00:00           180   
2  EAST COAST-CAPE ST FRANCIS AND SOUTH  2010-01-01 19:00:00            45   
3  EAST COAST-CAPE ST FRANCIS AND SOUTH  2010-01-03 03:00:00            45   
4                       NORTHEAST COAST  2010-01-01 09:00:00            45   
5                       NORTHEAST COAST  2010-01-01 12:00:00            45   
6                       NORTHEAST COAST  2010-01-03 03:00:00            45   

   windSpeed  
0         20  
1         90  
2         35  
3         35  
4         45  
5         15  
6         15  

Upvotes: 1

Terry
Terry

Reputation: 2811

An option would be using .groupby to select the last line and .concat to join them

s = df.groupby('case').tail(1)
s.loc[:, 'forecastTime'] = s.loc[:, 'endtime']

df_new = pd.concat([df, s], ignore_index=True).sort_values('case')

Upvotes: 1

Ayoub ZAROU
Ayoub ZAROU

Reputation: 2417

you could first create a dataframe keeping only the last row and then concatenate the results

added = df.groupby(['case'], as_index=False).last().assign(forecastTime=lambda subdf: subdf.endtime)
pd.concat([df, added]).sort_values(by=['case'])

Upvotes: 1

Related Questions