Add new column with each row as previous group value from another column

Question

I have a dataframe:

data = pd.DataFrame({'start':['2020-08-01','2020-08-02','2020-08-03','2020-08-04','2020-08-05','2020-08-06','2020-08-07','2020-08-08'],
                     'end':['2020-08-03','2020-08-03','2020-08-06','2020-08-06','2020-08-06','2020-08-07','2020-08-08','2020-08-08'],
                     'score':[74, 81, 38, 49, 79, 17, 53, 69]})

that I need to add a new column in which each row is the end date from its previous group. The desired dataframe is:

         start         end    prev_end  score
0   2020-08-01  2020-08-03         NaN     74
1   2020-08-02  2020-08-03         NaN     81
2   2020-08-03  2020-08-06  2020-08-03     38
3   2020-08-04  2020-08-06  2020-08-03     49
4   2020-08-05  2020-08-06  2020-08-03     79
5   2020-08-06  2020-08-07  2020-08-06     17
6   2020-08-07  2020-08-08  2020-08-07     53
7   2020-08-08  2020-08-08  2020-08-07     69

so that I can mark the rows where start is greater than the prev_end date as below:

         start         end    prev_end  mark  score
0   2020-08-01  2020-08-03         NaN     0     74
1   2020-08-02  2020-08-03         NaN     0     81
2   2020-08-03  2020-08-06  2020-08-03     0     38
3   2020-08-04  2020-08-06  2020-08-03     1     49
4   2020-08-05  2020-08-06  2020-08-03     1     79
5   2020-08-06  2020-08-07  2020-08-06     0     17
6   2020-08-07  2020-08-08  2020-08-07     0     53
7   2020-08-08  2020-08-08  2020-08-07     1     69

Is there an easy solution to do this? Or some direct way without having to create prev_end? Many thanks!

jezrael · Accepted Answer

You can remove duplciates from end column and create dictionary by zip with omitted first value by indexing and pass to Series.map, for mark column compare by Series.gt and convert to 0,1 by Series.view:

data[['start','end']] = data[['start','end']].apply(pd.to_datetime)

s = data['end'].drop_duplicates()
data['prev_end'] = data['end'].map(dict(zip(s[1:], s)))

data['mark'] = data['start'].gt(data['prev_end']).view('i1')
print (data)
       start        end  score   prev_end  mark
0 2020-08-01 2020-08-03     74        NaT     0
1 2020-08-02 2020-08-03     81        NaT     0
2 2020-08-03 2020-08-06     38 2020-08-03     0
3 2020-08-04 2020-08-06     49 2020-08-03     1
4 2020-08-05 2020-08-06     79 2020-08-03     1
5 2020-08-06 2020-08-07     17 2020-08-06     0
6 2020-08-07 2020-08-08     53 2020-08-07     0
7 2020-08-08 2020-08-08     69 2020-08-07     1

If order of columns is important is possible use DataFrame.insert or change order by pass list of columns in expected order:

data[['start','end']] = data[['start','end']].apply(pd.to_datetime)

s = data['end'].drop_duplicates()
data.insert(2, 'prev_end', data['end'].map(dict(zip(s[1:], s))))
data.insert(3, 'mark', data['start'].gt(data['prev_end']).view('i1'))

print (data)
       start        end   prev_end  mark  score
0 2020-08-01 2020-08-03        NaT     0     74
1 2020-08-02 2020-08-03        NaT     0     81
2 2020-08-03 2020-08-06 2020-08-03     0     38
3 2020-08-04 2020-08-06 2020-08-03     1     49
4 2020-08-05 2020-08-06 2020-08-03     1     79
5 2020-08-06 2020-08-07 2020-08-06     0     17
6 2020-08-07 2020-08-08 2020-08-07     0     53
7 2020-08-08 2020-08-08 2020-08-07     1     69

data = data[['start', 'end', 'prev_end', 'mark', 'score']]

Add new column with each row as previous group value from another column

Answers (2)

Related Questions