Shubham R
Shubham R

Reputation: 7644

Change values of a dataframe if value is null, from another dataframe pandas

i have a dataframe df1:

id   age   val
10    2     10
20    3     NaN
30    4     55
40    5     NaN   

I have another dataframe df2:

age  val_main
 1    111
 2    222
 3    333
 4    444
 5    555
 6    666

I want to only replace value from df1 where it is NaN with its corresponding value from df2.

Final output im looking for is:

id   age   val
10    2     10
20    3     333
30    4     55
40    5     555   

i tried iterating the df1 by for loop and then locating the values from df2 where the row in df1 is null.

eg:

for index,row in df1.iterrows():
    if row['val'].isnull():
        df2.loc[df2.age==row.age].val
        .....
        .....

But i'm looking for a more robust and intelligent way of doing this.

Upvotes: 4

Views: 3800

Answers (4)

jezrael
jezrael

Reputation: 862911

Use Series.map by dictionary with Series.fillna:

d = dict(zip(df2['age'], df2['val_main']))
#alternative with Series
#d = df2.set_index('age')['val_main']
df1['val'] = df1['val'].fillna(df1['age'].map(d))
print (df1)
   id  age    val
0  10    2   10.0
1  20    3  333.0
2  30    4   55.0
3  40    5  555.0

Performance depends of number of rows in both dataFrames and also by number of unique values and number of missing values:

np.random.seed(123)
N = 1000000

df2 = pd.DataFrame({'age': np.arange(1,101),
                    'val_main':np.random.randint(1000, size=100)})
print (df2)

df1 = pd.DataFrame({'id': np.random.randint(1000, size=N),
                   'age':np.random.randint(100, size=N),
                   'val':np.random.choice([10,20,30,np.nan], size=N)})
print (df1)

In [229]: %%timeit
     ...: df = df1.merge(df2, on = 'age', how='left')
     ...: df['val'] = df.val.fillna(df.pop('val_main'))
     ...: 
172 ms ± 3.82 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [230]: %%timeit d = dict(zip(df2['age'], df2['val_main']))
     ...: #alternative with Series
     ...: #d = df2.set_index('age')['val_main']
     ...: df1['val'] = df1['val'].fillna(df1['age'].map(d))
     ...: 
40 ms ± 145 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Upvotes: 3

iamklaus
iamklaus

Reputation: 3770

i like pd.apply too

df1

   id  age   val
0  10    2  10.0
1  20    3   NaN
2  30    4  55.0
3  40    5   NaN

df2

   age  val_main
0    1       111
1    2       222
2    3       333
3    4       444
4    5       555
5    6       666

df1['val'] = df1[['age','val']].apply(lambda x: df2[df2['age'] == x.age].reset_index()['val_main'][0] if str(x.val) == 'nan' else x.val, axis=1 )

Output

   id  age    val
0  10    2   10.0
1  20    3  333.0
2  30    4   55.0
3  40    5  555.0

Upvotes: 0

yatu
yatu

Reputation: 88246

You could left .merge and .fillna the values in val with the merged column val_main:

df = df1.merge(df2, on = 'age', how='left')
df['val'] = df.val.fillna(df.pop('val_main'))

Output

   id  age    val
0  10    2   10.0
1  20    3  333.0
2  30    4   55.0
3  40    5  555.0

Upvotes: 2

Mohamed Thasin ah
Mohamed Thasin ah

Reputation: 11192

I think you are looking for this,

df1=pd.merge(df1,df2,on=['age'],how='left')
df1['val']=df1['val'].fillna(df1['val_main'])

df1.drop('val_main',1,inplace=True) #to remove unnecessary column

print (df1)

O/P:

   id  age    val 
0  10    2   10.0 
1  20    3  333.0 
2  30    4   55.0 
3  40    5  555.0 

Upvotes: 1

Related Questions