audioslaver756
audioslaver756

Reputation: 65

Delete all rows after the first appearance of a condition in a pandas data frame

Working with the following example:

df = pd.DataFrame({"Person":[1,1,2,2,2,2,3,3,3], "Bank":["OPEN","OPEN","OPEN","OPEN","CLOSED","OPEN","OPEN","CLOSED","CLOSED"]})

   Person   Bank
0       1   OPEN
1       1   OPEN
2       2   OPEN
3       2   OPEN
4       2   CLOSED
5       2   OPEN
6       3   OPEN
7       3   CLOSED
8       3   CLOSED

I'd like to produce an output that keeps all rows up to and including the first occurrence of CLOSED with each Person group. So it should look like:

   Person   Bank
0       1   OPEN
1       1   OPEN
2       2   OPEN
3       2   OPEN
4       2   CLOSED
6       3   OPEN
7       3   CLOSED

I was able to use this question to build an output that's close:

mask = (df['Bank']
    .where(df['Bank'] == 'OPEN')
    .groupby(df['Person'])
    .ffill(limit=1)
)
df[mask.notnull()]

# The above produces this
   Person   Bank
0       1   OPEN
1       1   OPEN
2       2   OPEN
3       2   OPEN
4       2   CLOSED
5       2   OPEN
6       3   OPEN
7       3   CLOSED

So my current code doesn't handle the case of something moving from CLOSED back to OPEN. Is there a good way to do this that isn't extremely slow?

Upvotes: 3

Views: 259

Answers (3)

Abhi_J
Abhi_J

Reputation: 2129

This might not be the most elegant solution but seems to work for my test scenarios:

Code:

import pandas as pd

df1 = pd.DataFrame({"Person":[1,1,2,2,2,2,2,3,3,3], "Bank":["OPEN","OPEN","OPEN","OPEN","CLOSED","OPEN","CLOSED","OPEN","CLOSED","CLOSED"]})
df2 = pd.DataFrame({"Person":[1,1,2,2,2,2,2,3,3,3], "Bank":["CLOSED","OPEN","OPEN","OPEN","CLOSED","OPEN","CLOSED","OPEN","CLOSED","CLOSED"]})


def process_data(df):
    persons = df['Person'].unique()
    idx_arr = []
    for p in persons:
        mask1 = df['Person'] == p
        mask2 = (df['Bank'] == "CLOSED") & mask1
        idx_arr += range(list(mask1).index(True), 1+list(mask1&mask2).index(True) if any(mask2) else
        len(mask1) - 1 - list(mask1)[::-1].index(True)+1)

    return df.iloc[idx_arr]

print(process_data(df1))
print(process_data(df2))

Input 1:

   Person    Bank
0       1    OPEN
1       1    OPEN
2       2    OPEN
3       2    OPEN
4       2  CLOSED
5       2    OPEN
6       2  CLOSED
7       3    OPEN
8       3  CLOSED
9       3  CLOSED

Output 1:

   Person    Bank
0       1    OPEN
1       1    OPEN
2       2    OPEN
3       2    OPEN
4       2  CLOSED
7       3    OPEN
8       3  CLOSED

Input 2:

   Person    Bank
0       1  CLOSED
1       1    OPEN
2       2    OPEN
3       2    OPEN
4       2  CLOSED
5       2    OPEN
6       2  CLOSED
7       3    OPEN
8       3  CLOSED
9       3  CLOSED

Output 2:

   Person    Bank
0       1  CLOSED
2       2    OPEN
3       2    OPEN
4       2  CLOSED
7       3    OPEN
8       3  CLOSED

Upvotes: 0

AmineBTG
AmineBTG

Reputation: 697

df = pd.DataFrame({"Person":[1,1,2,2,2,2,3,3,3], "Bank":["OPEN","OPEN","OPEN","OPEN","CLOSED","OPEN","OPEN","CLOSED","CLOSED"]})
for i in range(df.shape[0]):
    try:
        if df.iloc[i,:]["Bank"] == "CLOSED" and df.iloc[i+1,:]["Person"] == df.iloc[i,:]["Person"]:
            df.drop(df.index[i+1], axis=0, inplace=True).reset_index(True)
    except:
        pass

Upvotes: 0

ALollz
ALollz

Reputation: 59529

You can create the mask with a groupby. It requires 2 operations cummax + shift so the straight forward method is to use a slower apply, but for many groups you'll see better performance from 2 separate groupby calls using the built-in operations

m = (df['Bank'].eq('CLOSED')
       .groupby(df['Person'])
       .apply(lambda x: ~x.cummax().shift().fillna(False)))

# or
m = ~(df['Bank'].eq('CLOSED')
        .groupby(df['Person']).cummax()
        .groupby(df['Person']).shift()
        .fillna(False))

df[m]
   Person    Bank
0       1    OPEN
1       1    OPEN
2       2    OPEN
3       2    OPEN
4       2  CLOSED
6       3    OPEN
7       3  CLOSED

Upvotes: 2

Related Questions