ERJAN
ERJAN

Reputation: 24500

how to fill NaNs based on repeating id?

I have one basic df - a monthly report of loans , it's 1700 contract ids of loans given per some date(monthly) e.g. report per 01.01.2019:

contract_id    loan_sum   expiry_days_01_01_2019
  1            1000       20
  2            2300       3
  3            500        17
  4            1400       6
  5            890        5

another for 01.02.2019

 contract_id loan_sum   expiry_days_01_02_2019
      1      7000       20
      2      3000       39
      3      500        130
      4      9400       89
      5      909        7

etc.

I have 5 reports with same number of ids. I have combined them under another.

Now i have many nan values for expiry-date column, like in a matrix with diagonal filled properly with ones.

 id  contract_id     loan_sum   expiry_days_01_01_2019 expiry_01_02_2019
   0           1      7000       20                      nan
   1           2      3000       39                      nan
   2           3      500        130                     nan
   3           4      9400       89                      nan
   4           5      909        7                       nan
   5           1      7000       nan                     20
   6           2      3000       nan                     39
   7           3      500        nan                     130
   8           4      9400       nan                     89
   9           5      909        nan                     7
   ...           ...                     ...

When the contract id repeats it should pull the same date, but instead they become NaNs.

How to fill nan values based on same , repeating contract id?

Upvotes: 1

Views: 195

Answers (1)

Ben.T
Ben.T

Reputation: 29635

you could use groupby.transform with first. if the altogether dataframe is called dfm, then:

# I assumed you put the dataframes together like this
dfm = pd.concat([df1, df2], axis=0)

col_exp = dfm.filter(like='expiry').columns
dfm[col_exp] = dfm.groupby('contract_id')[col_exp].transform('first')
print (dfm)
   contract_id  loan_sum  expiry_days_01_01_2019  expiry_days_01_02_2019
0            1      1000                    20.0                    20.0
1            2      2300                     3.0                    39.0
2            3       500                    17.0                   130.0
3            4      1400                     6.0                    89.0
4            5       890                     5.0                     7.0
0            1      7000                    20.0                    20.0
1            2      3000                     3.0                    39.0
2            3       500                    17.0                   130.0
3            4      9400                     6.0                    89.0
4            5       909                     5.0                     7.0

But I think changing the way the altogether dataframe is build could be another solution and create multiindex columns, something like:

list_dfs = [df1, df2]
dfm = pd.concat([df_.set_index('contract_id')
                    .rename(columns=lambda x: x.split('_')[0]) 
                 for df_ in list_dfs], 
                 keys=[df_.filter(like='expiry').columns[0][-10:]  
                       for df_ in list_dfs], 
                 axis=1)
print (dfm)
            01_01_2019        01_02_2019       
                  loan expiry       loan expiry
contract_id                                    
1                 1000     20       7000     20
2                 2300      3       3000     39
3                  500     17        500    130
4                 1400      6       9400     89
5                  890      5        909      7

Upvotes: 2

Related Questions