Updating values for a subset of a subset of a pandas dataframe too slow for large data set

Question

Problem Statement: I'm working with transaction data for all of a hospital's visits and I need to remove every bad debt transaction after the first for each patient.

Issue I'm Having: My code works on a small dataset, but the actual data set is about 5GB and 13M rows. The code has been running for several days now and still hasn't finished. For background, my code is in a Jupyter notebook running on a standard work PC.

Sample Data

import pandas as pd

    df = pd.DataFrame({"PatientAccountNumber":[113,113,113,113,225,225,225,225,225,225,225], 
                       "TransactionCode":['50','50','77','60','22','77','25','77','25','77','77'],
                       "Bucket":['Charity','Charity','Bad Debt','3rd Party','Self Pay','Bad Debt',
                                 'Charity','Bad Debt','Charity','Bad Debt','Bad Debt']})
    
    
    print(df)

Sample Dataframe

    PatientAccountNumber TransactionCode     Bucket
0                    113              50    Charity
1                    113              50    Charity
2                    113              77   Bad Debt
3                    113              60  3rd Party
4                    225              22   Self Pay
5                    225              77   Bad Debt
6                    225              25    Charity
7                    225              77   Bad Debt
8                    225              25    Charity
9                    225              77   Bad Debt
10                   225              77   Bad Debt

Solution

for account in df['PatientAccountNumber'].unique():
    mask = (df['PatientAccountNumber'] == account) & (df['Bucket'] == 'Bad Debt')
    df.drop(df[mask].index[1:],inplace=True)

print(df)

Desired Result (Each patient should have a maximum of one Bad Debt transaction)

   PatientAccountNumber TransactionCode     Bucket
0                   113              50    Charity
1                   113              50    Charity
2                   113              77   Bad Debt
3                   113              60  3rd Party
4                   225              22   Self Pay
5                   225              77   Bad Debt
6                   225              25    Charity
8                   225              25    Charity

Alternate Solution

for account in df['PatientAccountNumber'].unique():
    mask = (df['PatientAccountNumber'] == account) & (df['Bucket'] == 'Bad Debt')
    mask = mask & (mask.cumsum() > 1)
    df.loc[mask, 'Bucket'] = 'DELETE'

df = df[df['Bucket'] != 'DELETE]

Attempted using Dask

I thought maybe Dask would be able to help me out, but I got the following error codes:

Using Dask on first solution - "NotImplementedError: Series getitem in only supported for other series objects with matching partition structure"
Using Dask on second solution - "TypeError: '_LocIndexer' object does not support item assignment"

anky · Accepted Answer

You can solve this using df.duplicated on both accountNumber and Bucket and checking if Bucket is Bad Debt:

df[~(df.duplicated(['PatientAccountNumber','Bucket']) & df['Bucket'].eq("Bad Debt"))]

   PatientAccountNumber TransactionCode     Bucket
0                   113              50    Charity
1                   113              50    Charity
2                   113              77   Bad Debt
3                   113              60  3rd Party
4                   225              22   Self Pay
5                   225              77   Bad Debt
6                   225              25    Charity
8                   225              25    Charity