Reputation: 33
Problem Statement: I'm working with transaction data for all of a hospital's visits and I need to remove every bad debt transaction after the first for each patient.
Issue I'm Having: My code works on a small dataset, but the actual data set is about 5GB and 13M rows. The code has been running for several days now and still hasn't finished. For background, my code is in a Jupyter notebook running on a standard work PC.
import pandas as pd
df = pd.DataFrame({"PatientAccountNumber":[113,113,113,113,225,225,225,225,225,225,225],
"TransactionCode":['50','50','77','60','22','77','25','77','25','77','77'],
"Bucket":['Charity','Charity','Bad Debt','3rd Party','Self Pay','Bad Debt',
'Charity','Bad Debt','Charity','Bad Debt','Bad Debt']})
print(df)
PatientAccountNumber TransactionCode Bucket
0 113 50 Charity
1 113 50 Charity
2 113 77 Bad Debt
3 113 60 3rd Party
4 225 22 Self Pay
5 225 77 Bad Debt
6 225 25 Charity
7 225 77 Bad Debt
8 225 25 Charity
9 225 77 Bad Debt
10 225 77 Bad Debt
for account in df['PatientAccountNumber'].unique():
mask = (df['PatientAccountNumber'] == account) & (df['Bucket'] == 'Bad Debt')
df.drop(df[mask].index[1:],inplace=True)
print(df)
PatientAccountNumber TransactionCode Bucket
0 113 50 Charity
1 113 50 Charity
2 113 77 Bad Debt
3 113 60 3rd Party
4 225 22 Self Pay
5 225 77 Bad Debt
6 225 25 Charity
8 225 25 Charity
for account in df['PatientAccountNumber'].unique():
mask = (df['PatientAccountNumber'] == account) & (df['Bucket'] == 'Bad Debt')
mask = mask & (mask.cumsum() > 1)
df.loc[mask, 'Bucket'] = 'DELETE'
df = df[df['Bucket'] != 'DELETE]
I thought maybe Dask would be able to help me out, but I got the following error codes:
Upvotes: 3
Views: 130
Reputation: 75100
You can solve this using df.duplicated on both accountNumber and Bucket and checking if Bucket is Bad Debt:
df[~(df.duplicated(['PatientAccountNumber','Bucket']) & df['Bucket'].eq("Bad Debt"))]
PatientAccountNumber TransactionCode Bucket
0 113 50 Charity
1 113 50 Charity
2 113 77 Bad Debt
3 113 60 3rd Party
4 225 22 Self Pay
5 225 77 Bad Debt
6 225 25 Charity
8 225 25 Charity
Upvotes: 3
Reputation: 120469
Create a boolean mask without loop:
mask = df[df['Bucket'].eq('Bad Debt')].duplicated('PatientAccountNumber')
df.drop(mask[mask].index, inplace=True)
>>> df
PatientAccountNumber TransactionCode Bucket
0 113 50 Charity
1 113 50 Charity
2 113 77 Bad Debt
3 113 60 3rd Party
4 225 22 Self Pay
5 225 77 Bad Debt
6 225 25 Charity
8 225 25 Charity
Upvotes: 2