find the earliest and latest dates between two columns

Question

I have the following dataframe df.

    id  start   finish  location
0   1   2015-12-14 16:44:00 2015-12-15 18:00:00 A
1   1   2015-12-15 18:00:00 2015-12-16 13:00:00 B
2   1   2015-12-16 13:00:00 2015-12-16 20:00:00 C
3   2   2015-12-10 13:15:00 2015-12-12 13:45:00 B
4   2   2015-12-12 13:45:00 2015-12-12 19:45:00 A
5   3   2015-12-15 07:45:00 2015-12-15 18:45:00 A
6   3   2015-12-15 18:45:00 2015-12-18 07:15:00 D
7   3   2015-12-18 07:15:00 2015-12-19 10:45:00 C
8   3   2015-12-19 10:45:00 2015-12-20 09:00:00 H

I wanted to find the the id_start_date and id_end_date for every id.

In the above example, there are start and finish dates for every line. I want to have two new columns id_start_date and id_end_date. In the id_start_date column, I want to find the earliest most date in the start column specific to every id. This is easy. I can first sort the data based on id and start, then I can just pick the first start date in every id or I can do group-by based on id and later use aggregate function to find the minimum date in the start column. For the id_end_date, I can do the same. I can group-by based on id and use aggregate function to find the maximum date in the finish column.

df1 = df.sort_values(['id','start'],ascending=True)
gp = df1.groupby('id')
gp_out = gp.agg({'start': {'mindate': np.min}, 'finish': {'maxdate': np.max}})

when I print gp_out, It does show the correct dates but how would I write them back to the original dataframe df. I expect the following:

id  start   finish  location id_start_date id_end_date
0   1   2015-12-14 16:44:00 2015-12-15 18:00:00 A 2015-12-14 16:44:00 2015-12-16 20:00:00
1   1   2015-12-15 18:00:00 2015-12-16 13:00:00 B 2015-12-14 16:44:00 2015-12-16 20:00:00
2   1   2015-12-16 13:00:00 2015-12-16 20:00:00 C 2015-12-14 16:44:00 2015-12-16 20:00:00
3   2   2015-12-10 13:15:00 2015-12-12 13:45:00 B 2015-12-10 13:15:00 2015-12-12 19:45:00
4   2   2015-12-12 13:45:00 2015-12-12 19:45:00 A 2015-12-10 13:15:00 2015-12-12 19:45:00
5   3   2015-12-15 07:45:00 2015-12-15 18:45:00 A 2015-12-15 07:45:00 2015-12-20 09:00:00
6   3   2015-12-15 18:45:00 2015-12-18 07:15:00 D 2015-12-15 07:45:00 2015-12-20 09:00:00
7   3   2015-12-18 07:15:00 2015-12-19 10:45:00 C 2015-12-15 07:45:00 2015-12-20 09:00:00
8   3   2015-12-19 10:45:00 2015-12-20 09:00:00 H 2015-12-15 07:45:00 2015-12-20 09:00:00

How can i get the last two columns into the original dataframe df?

BENY · Accepted Answer

Using transform

g=df.groupby('id')
df['id_start_date']=g['start'].transform('min')
df['id_end_date']=g['finish'].transform('max')

find the earliest and latest dates between two columns

Answers (1)

Related Questions