Reputation: 65
Working with the following example:
df = pd.DataFrame({"Person":[1,1,2,2,2,2,3,3,3], "Bank":["OPEN","OPEN","OPEN","OPEN","CLOSED","OPEN","OPEN","CLOSED","CLOSED"]})
Person Bank
0 1 OPEN
1 1 OPEN
2 2 OPEN
3 2 OPEN
4 2 CLOSED
5 2 OPEN
6 3 OPEN
7 3 CLOSED
8 3 CLOSED
I'd like to produce an output that keeps all rows up to and including the first occurrence of CLOSED
with each Person
group. So it should look like:
Person Bank
0 1 OPEN
1 1 OPEN
2 2 OPEN
3 2 OPEN
4 2 CLOSED
6 3 OPEN
7 3 CLOSED
I was able to use this question to build an output that's close:
mask = (df['Bank']
.where(df['Bank'] == 'OPEN')
.groupby(df['Person'])
.ffill(limit=1)
)
df[mask.notnull()]
# The above produces this
Person Bank
0 1 OPEN
1 1 OPEN
2 2 OPEN
3 2 OPEN
4 2 CLOSED
5 2 OPEN
6 3 OPEN
7 3 CLOSED
So my current code doesn't handle the case of something moving from CLOSED
back to OPEN
. Is there a good way to do this that isn't extremely slow?
Upvotes: 3
Views: 259
Reputation: 2129
This might not be the most elegant solution but seems to work for my test scenarios:
Code:
import pandas as pd
df1 = pd.DataFrame({"Person":[1,1,2,2,2,2,2,3,3,3], "Bank":["OPEN","OPEN","OPEN","OPEN","CLOSED","OPEN","CLOSED","OPEN","CLOSED","CLOSED"]})
df2 = pd.DataFrame({"Person":[1,1,2,2,2,2,2,3,3,3], "Bank":["CLOSED","OPEN","OPEN","OPEN","CLOSED","OPEN","CLOSED","OPEN","CLOSED","CLOSED"]})
def process_data(df):
persons = df['Person'].unique()
idx_arr = []
for p in persons:
mask1 = df['Person'] == p
mask2 = (df['Bank'] == "CLOSED") & mask1
idx_arr += range(list(mask1).index(True), 1+list(mask1&mask2).index(True) if any(mask2) else
len(mask1) - 1 - list(mask1)[::-1].index(True)+1)
return df.iloc[idx_arr]
print(process_data(df1))
print(process_data(df2))
Input 1:
Person Bank
0 1 OPEN
1 1 OPEN
2 2 OPEN
3 2 OPEN
4 2 CLOSED
5 2 OPEN
6 2 CLOSED
7 3 OPEN
8 3 CLOSED
9 3 CLOSED
Output 1:
Person Bank
0 1 OPEN
1 1 OPEN
2 2 OPEN
3 2 OPEN
4 2 CLOSED
7 3 OPEN
8 3 CLOSED
Input 2:
Person Bank
0 1 CLOSED
1 1 OPEN
2 2 OPEN
3 2 OPEN
4 2 CLOSED
5 2 OPEN
6 2 CLOSED
7 3 OPEN
8 3 CLOSED
9 3 CLOSED
Output 2:
Person Bank
0 1 CLOSED
2 2 OPEN
3 2 OPEN
4 2 CLOSED
7 3 OPEN
8 3 CLOSED
Upvotes: 0
Reputation: 697
df = pd.DataFrame({"Person":[1,1,2,2,2,2,3,3,3], "Bank":["OPEN","OPEN","OPEN","OPEN","CLOSED","OPEN","OPEN","CLOSED","CLOSED"]})
for i in range(df.shape[0]):
try:
if df.iloc[i,:]["Bank"] == "CLOSED" and df.iloc[i+1,:]["Person"] == df.iloc[i,:]["Person"]:
df.drop(df.index[i+1], axis=0, inplace=True).reset_index(True)
except:
pass
Upvotes: 0
Reputation: 59529
You can create the mask with a groupby
. It requires 2 operations cummax
+ shift
so the straight forward method is to use a slower apply, but for many groups you'll see better performance from 2 separate groupby
calls using the built-in operations
m = (df['Bank'].eq('CLOSED')
.groupby(df['Person'])
.apply(lambda x: ~x.cummax().shift().fillna(False)))
# or
m = ~(df['Bank'].eq('CLOSED')
.groupby(df['Person']).cummax()
.groupby(df['Person']).shift()
.fillna(False))
df[m]
Person Bank
0 1 OPEN
1 1 OPEN
2 2 OPEN
3 2 OPEN
4 2 CLOSED
6 3 OPEN
7 3 CLOSED
Upvotes: 2