getting date of change based on column values in pandas dataframe

Question

I have the following dataframe:

fid         date       stage
test_fid    4/22/2019   a1
test_fid    4/23/2019   a1
test_fid    4/24/2019   a2
test_fid    4/25/2019   a2
test_fid    4/26/2019   a2
test_fid    4/27/2019   a3
test_fid    4/28/2019   a3
test_fid    4/29/2019   a3
test_fid1   4/30/2019   a1
test_fid1   5/1/2019    a1
test_fid1   5/2/2019    a1
test_fid1   5/3/2019    a1
test_fid1   5/4/2019    a2
test_fid1   5/5/2019    a2
test_fid1   5/6/2019    a2
test_fid1   5/7/2019    a2
test_fid1   5/8/2019    a3
test_fid1   5/9/2019    a3
test_fid1   5/10/2019   a3

I want to identify the date at which the stage column values starts and end e.g. test_fid has stage a1 from 4/22/2019 to 4/23/2019. The results should look like so:

fid        stage    start_date  end_date
test_fid    a1  4/22/2019   4/23/2019
test_fid    a2  4/24/2019   4/26/2019
test_fid    a3  4/27/2019   4/29/2019
test_fid1   a1  4/30/2019   5/3/2019
test_fid1   a2  5/4/2019    5/7/2019
test_fid1   a3  5/8/2019    5/10/2019

I tried this:

df['stage_change'] = df['stage'].diff()
df_filtered = df[df['stage_change'] != 0]

willeM_ Van Onsem · Accepted Answer

You probably forgot to parse your date column to a date object, you can do that, like @pythonic said with:

df['date'] = pd.to_datetime(df['date'])

Probably the most robust way to do this is calculating the minimum and maximum of the date for each group, like:

>>> df.groupby(['fid', 'stage'])['date'].agg({'start_date': 'min', 'end_date':'max'})
                    start_date       end_date
fid       stage                              
test_fid  a1         4/22/2019      4/23/2019
          a2         4/24/2019      4/26/2019
          a3         4/27/2019      4/29/2019
test_fid1 a1         4/30/2019       5/3/2019
          a2          5/4/2019       5/7/2019
          a3         5/10/2019       5/9/2019

Or if you do not wish to use fid and stage as index, you can reset the index:

>>> df.groupby(['fid', 'stage'])['date'].agg({'start_date': 'min', 'end_date':'max'}).reset_index()
         fid stage     start_date       end_date
0   test_fid    a1      4/22/2019      4/23/2019
1   test_fid    a2      4/24/2019      4/26/2019
2   test_fid    a3      4/27/2019      4/29/2019
3  test_fid1    a1      4/30/2019       5/3/2019
4  test_fid1    a2       5/4/2019       5/7/2019
5  test_fid1    a3      5/10/2019       5/9/2019

getting date of change based on column values in pandas dataframe

Answers (2)

Related Questions