Reputation: 2967
I have a dataframe:
data = pd.DataFrame({'start':['2020-08-01','2020-08-02','2020-08-03','2020-08-04','2020-08-05','2020-08-06','2020-08-07','2020-08-08'],
'end':['2020-08-03','2020-08-03','2020-08-06','2020-08-06','2020-08-06','2020-08-07','2020-08-08','2020-08-08'],
'score':[74, 81, 38, 49, 79, 17, 53, 69]})
that I need to add a new column in which each row is the end
date from its previous group. The desired dataframe is:
start end prev_end score
0 2020-08-01 2020-08-03 NaN 74
1 2020-08-02 2020-08-03 NaN 81
2 2020-08-03 2020-08-06 2020-08-03 38
3 2020-08-04 2020-08-06 2020-08-03 49
4 2020-08-05 2020-08-06 2020-08-03 79
5 2020-08-06 2020-08-07 2020-08-06 17
6 2020-08-07 2020-08-08 2020-08-07 53
7 2020-08-08 2020-08-08 2020-08-07 69
so that I can mark the rows where start
is greater than the prev_end
date as below:
start end prev_end mark score
0 2020-08-01 2020-08-03 NaN 0 74
1 2020-08-02 2020-08-03 NaN 0 81
2 2020-08-03 2020-08-06 2020-08-03 0 38
3 2020-08-04 2020-08-06 2020-08-03 1 49
4 2020-08-05 2020-08-06 2020-08-03 1 79
5 2020-08-06 2020-08-07 2020-08-06 0 17
6 2020-08-07 2020-08-08 2020-08-07 0 53
7 2020-08-08 2020-08-08 2020-08-07 1 69
Is there an easy solution to do this? Or some direct way without having to create prev_end
? Many thanks!
Upvotes: 1
Views: 55
Reputation: 25239
Another way using drop_duplicates
, shift
, reindex
data['prev_end'] = data.end.drop_duplicates().shift().reindex(data.index, method='pad')
data['mark'] = data.start.gt(data.prev_end).astype(int)
Out[39]:
start end score prev_end mark
0 2020-08-01 2020-08-03 74 NaT 0
1 2020-08-02 2020-08-03 81 NaT 0
2 2020-08-03 2020-08-06 38 2020-08-03 0
3 2020-08-04 2020-08-06 49 2020-08-03 1
4 2020-08-05 2020-08-06 79 2020-08-03 1
5 2020-08-06 2020-08-07 17 2020-08-06 0
6 2020-08-07 2020-08-08 53 2020-08-07 0
7 2020-08-08 2020-08-08 69 2020-08-07 1
Upvotes: 2
Reputation: 862511
You can remove duplciates from end
column and create dictionary by zip
with omitted first value by indexing and pass to Series.map
, for mark
column compare by Series.gt
and convert to 0,1
by Series.view
:
data[['start','end']] = data[['start','end']].apply(pd.to_datetime)
s = data['end'].drop_duplicates()
data['prev_end'] = data['end'].map(dict(zip(s[1:], s)))
data['mark'] = data['start'].gt(data['prev_end']).view('i1')
print (data)
start end score prev_end mark
0 2020-08-01 2020-08-03 74 NaT 0
1 2020-08-02 2020-08-03 81 NaT 0
2 2020-08-03 2020-08-06 38 2020-08-03 0
3 2020-08-04 2020-08-06 49 2020-08-03 1
4 2020-08-05 2020-08-06 79 2020-08-03 1
5 2020-08-06 2020-08-07 17 2020-08-06 0
6 2020-08-07 2020-08-08 53 2020-08-07 0
7 2020-08-08 2020-08-08 69 2020-08-07 1
If order of columns is important is possible use DataFrame.insert
or change order by pass list of columns in expected order:
data[['start','end']] = data[['start','end']].apply(pd.to_datetime)
s = data['end'].drop_duplicates()
data.insert(2, 'prev_end', data['end'].map(dict(zip(s[1:], s))))
data.insert(3, 'mark', data['start'].gt(data['prev_end']).view('i1'))
print (data)
start end prev_end mark score
0 2020-08-01 2020-08-03 NaT 0 74
1 2020-08-02 2020-08-03 NaT 0 81
2 2020-08-03 2020-08-06 2020-08-03 0 38
3 2020-08-04 2020-08-06 2020-08-03 1 49
4 2020-08-05 2020-08-06 2020-08-03 1 79
5 2020-08-06 2020-08-07 2020-08-06 0 17
6 2020-08-07 2020-08-08 2020-08-07 0 53
7 2020-08-08 2020-08-08 2020-08-07 1 69
data = data[['start', 'end', 'prev_end', 'mark', 'score']]
Upvotes: 4