Reputation: 2130
There must be a better way to do this, please help me
Here's an extract of some of the data I have to clean, which has several kind of "duplicate" rows (not all the row is duplicated):
df =
LoanID | CustomerID | LoanStatus | CreditScore | AnnualIncome | ...
-------+------------+------------+-------------+--------------+-----
100 | ABC | Paid | NaN | 34200 |
100 | ABC | Paid | 724 | 34200 |
200 | DEF | Write Off | 611 | 9800 |
200 | DEF | Write Off | 611 | NaN |
300 | GHI | Paid | NaN | 247112 |
300 | GHI | Paid | 799 | NaN |
400 | JKL | Paid | NaN | NaN |
500 | MNO | Paid | 444 | NaN |
So I have the following type of duplicate cases:
So, obviously what I want is to have a dataframe without the duplicates like:
LoanID | CustomerID | LoanStatus | CreditScore | AnnualIncome | ...
-------+------------+------------+-------------+--------------+-----
100 | ABC | Paid | 724 | 34200 |
200 | DEF | Write Off | 611 | 9800 |
300 | GHI | Paid | 799 | 247112 |
400 | JKL | Paid | NaN | NaN |
500 | MNO | Paid | 444 | NaN |
So, how I have solved this with:
# Get the repeated keys:
rep = df['LoanID'].value_counts()
rep = rep[rep > 2]
# Now we get the valid number (we overwrite the NaNs)
for i in rep.keys():
df.loc[df['LoanID'] == i, 'CreditScore'] = df[df['LoanID'] == i]['CreditScore'].max()
df.loc[df['LoanID'] == i, 'AnnualIncome'] = df[df['LoanID'] == i]['AnnualIncome'].max()
# Drop duplicates
df.drop_duplicates(inplace=True)
This works, does exactly what I need, the problem is that this dataframe is several 100k records, so this method takes "forever", there must be some way to do it better, right?
Upvotes: 3
Views: 1051
Reputation: 57105
Grouping by loan id, filling in missing values both above and below, and removing duplicates seems to work:
df.groupby('LoanID').apply(lambda x: \
fillna(method='ffill').\
fillna(method='bfill').\
drop_duplicates()).\
reset_index(drop=True).\
set_index('LoanID')
# CustomerID LoanStatus CreditScore AnnualIncome
#LoanID
#100 ABC Paid 724.0 34200.0
#200 DEF Write Off 611.0 9800.0
#300 GHI Paid 799.0 247112.0
#400 JKL Paid NaN NaN
#500 MNO Paid 444.0 NaN
Upvotes: 3