Reputation: 451
I have the following dataframe df
.
id start finish location
0 1 2015-12-14 16:44:00 2015-12-15 18:00:00 A
1 1 2015-12-15 18:00:00 2015-12-16 13:00:00 B
2 1 2015-12-16 13:00:00 2015-12-16 20:00:00 C
3 2 2015-12-10 13:15:00 2015-12-12 13:45:00 B
4 2 2015-12-12 13:45:00 2015-12-12 19:45:00 A
5 3 2015-12-15 07:45:00 2015-12-15 18:45:00 A
6 3 2015-12-15 18:45:00 2015-12-18 07:15:00 D
7 3 2015-12-18 07:15:00 2015-12-19 10:45:00 C
8 3 2015-12-19 10:45:00 2015-12-20 09:00:00 H
I wanted to find the the id_start_date
and id_end_date
for every id.
In the above example, there are start and finish dates for every line. I want to have two new columns id_start_date
and id_end_date
. In the id_start_date
column, I want to find the earliest most date in the start column specific to every id. This is easy. I can first sort the data based on id and start, then I can just pick the first start date in every id or I can do group-by based on id and later use aggregate function to find the minimum date in the start column. For the id_end_date
, I can do the same. I can group-by based on id and use aggregate function to find the maximum date in the finish column.
df1 = df.sort_values(['id','start'],ascending=True)
gp = df1.groupby('id')
gp_out = gp.agg({'start': {'mindate': np.min}, 'finish': {'maxdate': np.max}})
when I print gp_out
, It does show the correct dates but how would I write them back to the original dataframe df
. I expect the following:
id start finish location id_start_date id_end_date
0 1 2015-12-14 16:44:00 2015-12-15 18:00:00 A 2015-12-14 16:44:00 2015-12-16 20:00:00
1 1 2015-12-15 18:00:00 2015-12-16 13:00:00 B 2015-12-14 16:44:00 2015-12-16 20:00:00
2 1 2015-12-16 13:00:00 2015-12-16 20:00:00 C 2015-12-14 16:44:00 2015-12-16 20:00:00
3 2 2015-12-10 13:15:00 2015-12-12 13:45:00 B 2015-12-10 13:15:00 2015-12-12 19:45:00
4 2 2015-12-12 13:45:00 2015-12-12 19:45:00 A 2015-12-10 13:15:00 2015-12-12 19:45:00
5 3 2015-12-15 07:45:00 2015-12-15 18:45:00 A 2015-12-15 07:45:00 2015-12-20 09:00:00
6 3 2015-12-15 18:45:00 2015-12-18 07:15:00 D 2015-12-15 07:45:00 2015-12-20 09:00:00
7 3 2015-12-18 07:15:00 2015-12-19 10:45:00 C 2015-12-15 07:45:00 2015-12-20 09:00:00
8 3 2015-12-19 10:45:00 2015-12-20 09:00:00 H 2015-12-15 07:45:00 2015-12-20 09:00:00
How can i get the last two columns into the original dataframe df
?
Upvotes: 1
Views: 701
Reputation: 323266
Using transform
g=df.groupby('id')
df['id_start_date']=g['start'].transform('min')
df['id_end_date']=g['finish'].transform('max')
Upvotes: 1