Best practice to remove multiple rows after a certain value occurs in the rows before?

Question

My data looks like this:

   Journey ID  Visit ID      Date  Conversion
0           1         1  20200101           1
1           1         2  20200102           1
2           1         3  20200103           0
3           2         4  20200104           0
4           2         5  20200105           1
5           3         6  20200106           1
6           3         7  20200107           0

My goal is to remove all rows that appear after a conversion within a Journey ID. No matter what the content of the rows is after the row with the conversion. In this example the rows with the index 1,2 and 6 should be removed.

I've been stuck here for a while. I've tried shifting columns and masking conditions but it gets to complicated and doesn't really work.

I need a safe, scalable solution for data with over 2 million rows which works safe and is easy to validate.

You can reproduce the example data set with this:

d = {'Journey ID': [1,1,1,2,2,3,3],
     'Visit ID': [1,2,3,4,5,6,7],
     'Date': [20200101,20200102,20200103, 20200104, 20200105,20200106,20200107],
     'Conversion': [1,1,0,0,1,1,0]}
data = pd.DataFrame(data = d)

anky · Accepted Answer

IIUC, you can try with cumsum with groupby:

out = data[data['Conversion'] == data.groupby('Journey ID')['Conversion'].cumsum()]

   Journey ID  Visit ID  Date+Time  Conversion
0           1         1  date_time           1
3           2         4  date_time           0
4           2         5  date_time           1
5           3         6  date_time           1

Here is the logic visualized:

   Journey ID  Visit ID      Date  Conversion  cumsum_Journey  delete
0           1         1  20200101           1               1   False
1           1         2  20200102           1               2    True
2           1         3  20200103           0               2    True
3           2         4  20200104           0               0   False
4           2         5  20200105           1               1   False
5           3         6  20200106           1               1   False
6           3         7  20200107           0               1    True

Best practice to remove multiple rows after a certain value occurs in the rows before?

Answers (1)

Related Questions