sfactor
sfactor

Reputation: 13062

Pandas: How to fill in null values from columns in another dataframe?

I have a dataframe where certain essential columns are NULL (that I need for further machine learning work). I have another dataframe with similar data from where I want to pull in the missing values from.

For example, df1 is the main dataframe

id     col1    col2     col3     col4    col5
1      A       AA       100      5.0     0.9
2      A       BB       150      4.2     0.5
3      A       CC       100      NaN     NaN
4      B       AA       300      NaN     NaN
5      B       BB       100      NaN     NaN
6      C       BB       50       3.4     0.6

The dataframe that I want to fill those NaN columns in col4 and col5 could be like

id     col1    col3     col4    col5
100      A     100      4.5     1.0
101      A     100      3.5     0.8
103      B     300      5.0     0.5
105      B     300      5.5     0.8
106      B     100      5.3     0.2
107      C     100      3.0     1.2

So, I don't have col2 in the second df and there are duplicates for the col1 and col2 columns that I can merge by. So, I have to choose the value with the maximum col4 value to fill the corresponding values in df1.

For example, the correct value for df1 after filling in the data would be:

id     col1    col2     col3     col4    col5
1      A       AA       100      5.0     0.9
1      A       BB       150      4.2     0.5
1      A       CC       100      4.5     1.0
1      B       AA       300      5.5     0.8
1      B       BB       100      5.3     0.2
1      C       BB       50       3.4     0.6

How would I do that?

Upvotes: 2

Views: 727

Answers (2)

unutbu
unutbu

Reputation: 880329

import numpy as np
import pandas as pd
nan = np.nan

df1 = pd.DataFrame({'col1': ['A', 'A', 'A', 'B', 'B', 'C'], 'col2': ['AA', 'BB', 'CC', 'AA', 'BB', 'BB'], 'col3': [100, 150, 100, 300, 100, 50], 'col4': [5.0, 4.2, nan, nan, nan, 3.4], 'col5': [0.9, 0.5, nan, nan, nan, 0.6], 'id': [1, 2, 3, 4, 5, 6]})
df2 = pd.DataFrame({'col1': ['A', 'A', 'B', 'B', 'B', 'C'], 'col3': [100, 100, 300, 300, 100, 100], 'col4': [4.5, 3.5, 5.0, 5.5, 5.3, 3.0], 'col5': [1.0, 99, 0.5, 0.8, 0.2, 1.2], 'id': [100, 101, 103, 105, 106, 107]})

df2_max = df2.drop('id', axis=1).groupby(['col1','col3']).max()
df3 = pd.merge(df1[['col1','col3']], df2_max, 
               left_on=['col1','col3'], right_index=True, how='left')
result = df1.combine_first(df3)

yields

  col1 col2  col3  col4  col5  id
0    A   AA   100   5.0   0.9   1
1    A   BB   150   4.2   0.5   2
2    A   CC   100   4.5  99.0   3
3    B   AA   300   5.5   0.8   4
4    B   BB   100   5.3   0.2   5
5    C   BB    50   3.4   0.6   6

First, find the max of df2's col4 and col5 columns for each value of col1,col3:

df2_max = df2.drop('id', axis=1).groupby(['col1','col3']).max()
#            col4  col5
# col1 col3            
# A    100    4.5  99.0
# B    100    5.3   0.2
#      300    5.5   0.8
# C    100    3.0   1.2

Notice the 99 in the first row (instead of 0.8). I changed the example slightly, to show that the col4 maximum need not occur in the same row as the col5 maximum.

Next, merge df1 and df2_max, on df1's col1,col3 columns, and df2's index:

df3 = pd.merge(df1[['col1','col3']], df2_max, 
               left_on=['col1','col3'], right_index=True, how='left')
#   col1  col3  col4  col5
# 0    A   100   4.5  99.0
# 1    A   150   NaN   NaN
# 2    A   100   4.5  99.0
# 3    B   300   5.5   0.8
# 4    B   100   5.3   0.2
# 5    C    50   NaN   NaN

This provides us with the maximums in col4 and col5 in a DataFrame whose index matches df1's index. This allows us to use df1.combine_first to fill in the NaNs with values from df3:

result = df1.combine_first(df3)

Upvotes: 1

piRSquared
piRSquared

Reputation: 294488

IIUC

df1.combine_first(
    df1.merge(
        df2.drop('id', 1).ix[df2.groupby(['col1', 'col3']).col4.idxmax()],
        on=['col1', 'col3'], how='left', suffixes=['_', '']
    )[['col4', 'col5']]
).reindex_axis(df1.columns, 1)

enter image description here

Upvotes: 1

Related Questions