Rock
Rock

Reputation: 2967

Add new column with each row as previous group value from another column

I have a dataframe:

data = pd.DataFrame({'start':['2020-08-01','2020-08-02','2020-08-03','2020-08-04','2020-08-05','2020-08-06','2020-08-07','2020-08-08'],
                     'end':['2020-08-03','2020-08-03','2020-08-06','2020-08-06','2020-08-06','2020-08-07','2020-08-08','2020-08-08'],
                     'score':[74, 81, 38, 49, 79, 17, 53, 69]})

that I need to add a new column in which each row is the end date from its previous group. The desired dataframe is:

         start         end    prev_end  score
0   2020-08-01  2020-08-03         NaN     74
1   2020-08-02  2020-08-03         NaN     81
2   2020-08-03  2020-08-06  2020-08-03     38
3   2020-08-04  2020-08-06  2020-08-03     49
4   2020-08-05  2020-08-06  2020-08-03     79
5   2020-08-06  2020-08-07  2020-08-06     17
6   2020-08-07  2020-08-08  2020-08-07     53
7   2020-08-08  2020-08-08  2020-08-07     69

so that I can mark the rows where start is greater than the prev_end date as below:

         start         end    prev_end  mark  score
0   2020-08-01  2020-08-03         NaN     0     74
1   2020-08-02  2020-08-03         NaN     0     81
2   2020-08-03  2020-08-06  2020-08-03     0     38
3   2020-08-04  2020-08-06  2020-08-03     1     49
4   2020-08-05  2020-08-06  2020-08-03     1     79
5   2020-08-06  2020-08-07  2020-08-06     0     17
6   2020-08-07  2020-08-08  2020-08-07     0     53
7   2020-08-08  2020-08-08  2020-08-07     1     69

Is there an easy solution to do this? Or some direct way without having to create prev_end? Many thanks!

Upvotes: 1

Views: 55

Answers (2)

Andy L.
Andy L.

Reputation: 25239

Another way using drop_duplicates, shift, reindex

data['prev_end'] = data.end.drop_duplicates().shift().reindex(data.index, method='pad')
data['mark'] = data.start.gt(data.prev_end).astype(int)

Out[39]:
       start        end  score   prev_end  mark
0 2020-08-01 2020-08-03     74        NaT     0
1 2020-08-02 2020-08-03     81        NaT     0
2 2020-08-03 2020-08-06     38 2020-08-03     0
3 2020-08-04 2020-08-06     49 2020-08-03     1
4 2020-08-05 2020-08-06     79 2020-08-03     1
5 2020-08-06 2020-08-07     17 2020-08-06     0
6 2020-08-07 2020-08-08     53 2020-08-07     0
7 2020-08-08 2020-08-08     69 2020-08-07     1

Upvotes: 2

jezrael
jezrael

Reputation: 862511

You can remove duplciates from end column and create dictionary by zip with omitted first value by indexing and pass to Series.map, for mark column compare by Series.gt and convert to 0,1 by Series.view:

data[['start','end']] = data[['start','end']].apply(pd.to_datetime)

s = data['end'].drop_duplicates()
data['prev_end'] = data['end'].map(dict(zip(s[1:], s)))

data['mark'] = data['start'].gt(data['prev_end']).view('i1')
print (data)
       start        end  score   prev_end  mark
0 2020-08-01 2020-08-03     74        NaT     0
1 2020-08-02 2020-08-03     81        NaT     0
2 2020-08-03 2020-08-06     38 2020-08-03     0
3 2020-08-04 2020-08-06     49 2020-08-03     1
4 2020-08-05 2020-08-06     79 2020-08-03     1
5 2020-08-06 2020-08-07     17 2020-08-06     0
6 2020-08-07 2020-08-08     53 2020-08-07     0
7 2020-08-08 2020-08-08     69 2020-08-07     1

If order of columns is important is possible use DataFrame.insert or change order by pass list of columns in expected order:

data[['start','end']] = data[['start','end']].apply(pd.to_datetime)

s = data['end'].drop_duplicates()
data.insert(2, 'prev_end', data['end'].map(dict(zip(s[1:], s))))
data.insert(3, 'mark', data['start'].gt(data['prev_end']).view('i1'))

print (data)
       start        end   prev_end  mark  score
0 2020-08-01 2020-08-03        NaT     0     74
1 2020-08-02 2020-08-03        NaT     0     81
2 2020-08-03 2020-08-06 2020-08-03     0     38
3 2020-08-04 2020-08-06 2020-08-03     1     49
4 2020-08-05 2020-08-06 2020-08-03     1     79
5 2020-08-06 2020-08-07 2020-08-06     0     17
6 2020-08-07 2020-08-08 2020-08-07     0     53
7 2020-08-08 2020-08-08 2020-08-07     1     69

data = data[['start', 'end', 'prev_end', 'mark', 'score']]

Upvotes: 4

Related Questions