luisfer
luisfer

Reputation: 2130

Python/Pandas: How to consolidate repeated rows with NaN in different columns?

There must be a better way to do this, please help me

Here's an extract of some of the data I have to clean, which has several kind of "duplicate" rows (not all the row is duplicated):

df =

LoanID | CustomerID | LoanStatus | CreditScore | AnnualIncome | ...
-------+------------+------------+-------------+--------------+-----
   100 | ABC        | Paid       |         NaN |        34200 |
   100 | ABC        | Paid       |         724 |        34200 |
   200 | DEF        | Write Off  |         611 |         9800 |
   200 | DEF        | Write Off  |         611 |          NaN |
   300 | GHI        | Paid       |         NaN |       247112 |
   300 | GHI        | Paid       |         799 |          NaN |
   400 | JKL        | Paid       |         NaN |          NaN |
   500 | MNO        | Paid       |         444 |          NaN |

So I have the following type of duplicate cases:

  1. A NaN and a valid value in column CreditScore (LoanID = 100)
  2. A NaN and a valid value in column AnnualIncome (LoanID = 200)
  3. A NaN and a valid value in column CreditScore AND a NaN and a valid value in column AnnualIncome (Loan ID=300)
  4. LoanID 400 and 500 are "normal" cases

So, obviously what I want is to have a dataframe without the duplicates like:

LoanID | CustomerID | LoanStatus | CreditScore | AnnualIncome | ...
-------+------------+------------+-------------+--------------+-----
   100 | ABC        | Paid       |         724 |        34200 |
   200 | DEF        | Write Off  |         611 |         9800 |
   300 | GHI        | Paid       |         799 |       247112 |
   400 | JKL        | Paid       |         NaN |          NaN |
   500 | MNO        | Paid       |         444 |          NaN |

So, how I have solved this with:

# Get the repeated keys:
rep = df['LoanID'].value_counts()
rep = rep[rep > 2]

# Now we get the valid number (we overwrite the NaNs)
for i in rep.keys():
    df.loc[df['LoanID'] == i, 'CreditScore']  = df[df['LoanID'] == i]['CreditScore'].max()
    df.loc[df['LoanID'] == i, 'AnnualIncome'] = df[df['LoanID'] == i]['AnnualIncome'].max()

# Drop duplicates   
df.drop_duplicates(inplace=True)

This works, does exactly what I need, the problem is that this dataframe is several 100k records, so this method takes "forever", there must be some way to do it better, right?

Upvotes: 3

Views: 1051

Answers (1)

DYZ
DYZ

Reputation: 57105

Grouping by loan id, filling in missing values both above and below, and removing duplicates seems to work:

df.groupby('LoanID').apply(lambda x: \
                             fillna(method='ffill').\
                             fillna(method='bfill').\
                             drop_duplicates()).\
                     reset_index(drop=True).\
                     set_index('LoanID')
#       CustomerID LoanStatus  CreditScore  AnnualIncome  
#LoanID                                                             
#100           ABC       Paid        724.0       34200.0       
#200           DEF  Write Off        611.0        9800.0       
#300           GHI       Paid        799.0      247112.0       
#400           JKL       Paid          NaN           NaN       
#500           MNO       Paid        444.0           NaN       

Upvotes: 3

Related Questions