Reputation: 89
My data looks like this:
Journey ID Visit ID Date Conversion
0 1 1 20200101 1
1 1 2 20200102 1
2 1 3 20200103 0
3 2 4 20200104 0
4 2 5 20200105 1
5 3 6 20200106 1
6 3 7 20200107 0
My goal is to remove all rows that appear after a conversion within a Journey ID. No matter what the content of the rows is after the row with the conversion. In this example the rows with the index 1,2 and 6 should be removed.
I've been stuck here for a while. I've tried shifting columns and masking conditions but it gets to complicated and doesn't really work.
I need a safe, scalable solution for data with over 2 million rows which works safe and is easy to validate.
You can reproduce the example data set with this:
d = {'Journey ID': [1,1,1,2,2,3,3],
'Visit ID': [1,2,3,4,5,6,7],
'Date': [20200101,20200102,20200103, 20200104, 20200105,20200106,20200107],
'Conversion': [1,1,0,0,1,1,0]}
data = pd.DataFrame(data = d)
Upvotes: 3
Views: 55
Reputation: 75080
IIUC, you can try with cumsum
with groupby
:
out = data[data['Conversion'] == data.groupby('Journey ID')['Conversion'].cumsum()]
Journey ID Visit ID Date+Time Conversion
0 1 1 date_time 1
3 2 4 date_time 0
4 2 5 date_time 1
5 3 6 date_time 1
Here is the logic visualized:
Journey ID Visit ID Date Conversion cumsum_Journey delete
0 1 1 20200101 1 1 False
1 1 2 20200102 1 2 True
2 1 3 20200103 0 2 True
3 2 4 20200104 0 0 False
4 2 5 20200105 1 1 False
5 3 6 20200106 1 1 False
6 3 7 20200107 0 1 True
Upvotes: 2