Reputation: 21961
I have the following dataframe:
fid date stage
test_fid 4/22/2019 a1
test_fid 4/23/2019 a1
test_fid 4/24/2019 a2
test_fid 4/25/2019 a2
test_fid 4/26/2019 a2
test_fid 4/27/2019 a3
test_fid 4/28/2019 a3
test_fid 4/29/2019 a3
test_fid1 4/30/2019 a1
test_fid1 5/1/2019 a1
test_fid1 5/2/2019 a1
test_fid1 5/3/2019 a1
test_fid1 5/4/2019 a2
test_fid1 5/5/2019 a2
test_fid1 5/6/2019 a2
test_fid1 5/7/2019 a2
test_fid1 5/8/2019 a3
test_fid1 5/9/2019 a3
test_fid1 5/10/2019 a3
I want to identify the date at which the stage column values starts and end e.g. test_fid has stage a1 from 4/22/2019 to 4/23/2019. The results should look like so:
fid stage start_date end_date
test_fid a1 4/22/2019 4/23/2019
test_fid a2 4/24/2019 4/26/2019
test_fid a3 4/27/2019 4/29/2019
test_fid1 a1 4/30/2019 5/3/2019
test_fid1 a2 5/4/2019 5/7/2019
test_fid1 a3 5/8/2019 5/10/2019
I tried this:
df['stage_change'] = df['stage'].diff()
df_filtered = df[df['stage_change'] != 0]
Upvotes: 3
Views: 505
Reputation: 476594
You probably forgot to parse your date
column to a date object, you can do that, like @pythonic said with:
df['date'] = pd.to_datetime(df['date'])
Probably the most robust way to do this is calculating the minimum and maximum of the date
for each group, like:
>>> df.groupby(['fid', 'stage'])['date'].agg({'start_date': 'min', 'end_date':'max'})
start_date end_date
fid stage
test_fid a1 4/22/2019 4/23/2019
a2 4/24/2019 4/26/2019
a3 4/27/2019 4/29/2019
test_fid1 a1 4/30/2019 5/3/2019
a2 5/4/2019 5/7/2019
a3 5/10/2019 5/9/2019
Or if you do not wish to use fid
and stage
as index, you can reset the index:
>>> df.groupby(['fid', 'stage'])['date'].agg({'start_date': 'min', 'end_date':'max'}).reset_index()
fid stage start_date end_date
0 test_fid a1 4/22/2019 4/23/2019
1 test_fid a2 4/24/2019 4/26/2019
2 test_fid a3 4/27/2019 4/29/2019
3 test_fid1 a1 4/30/2019 5/3/2019
4 test_fid1 a2 5/4/2019 5/7/2019
5 test_fid1 a3 5/10/2019 5/9/2019
Upvotes: 3
Reputation: 3224
Use sort_values
on date and groupby
. Then aggregate for the first and last date.
df.sort_values('date').groupby(['stage','fid']).agg({'date':['first', 'last']}).reset_index()
result
stage fid date
first last
0 a1 test_fid 2019-04-22 2019-04-23
1 a1 test_fid1 2019-04-30 2019-05-03
2 a2 test_fid 2019-04-24 2019-04-26
3 a2 test_fid1 2019-05-04 2019-05-07
4 a3 test_fid 2019-04-27 2019-04-29
5 a3 test_fid1 2019-05-08 2019-05-10
Edit: I first converted to datetime
df['date'] = pd.to_datetime(df['date'])
Upvotes: 3