Reputation: 7644
i have a dataframe df1:
id age val
10 2 10
20 3 NaN
30 4 55
40 5 NaN
I have another dataframe df2:
age val_main
1 111
2 222
3 333
4 444
5 555
6 666
I want to only replace value from df1 where it is NaN with its corresponding value from df2.
Final output im looking for is:
id age val
10 2 10
20 3 333
30 4 55
40 5 555
i tried iterating the df1 by for loop and then locating the values from df2 where the row in df1 is null.
eg:
for index,row in df1.iterrows():
if row['val'].isnull():
df2.loc[df2.age==row.age].val
.....
.....
But i'm looking for a more robust and intelligent way of doing this.
Upvotes: 4
Views: 3800
Reputation: 862911
Use Series.map
by dictionary with Series.fillna
:
d = dict(zip(df2['age'], df2['val_main']))
#alternative with Series
#d = df2.set_index('age')['val_main']
df1['val'] = df1['val'].fillna(df1['age'].map(d))
print (df1)
id age val
0 10 2 10.0
1 20 3 333.0
2 30 4 55.0
3 40 5 555.0
Performance depends of number of rows in both dataFrames and also by number of unique values and number of missing values:
np.random.seed(123)
N = 1000000
df2 = pd.DataFrame({'age': np.arange(1,101),
'val_main':np.random.randint(1000, size=100)})
print (df2)
df1 = pd.DataFrame({'id': np.random.randint(1000, size=N),
'age':np.random.randint(100, size=N),
'val':np.random.choice([10,20,30,np.nan], size=N)})
print (df1)
In [229]: %%timeit
...: df = df1.merge(df2, on = 'age', how='left')
...: df['val'] = df.val.fillna(df.pop('val_main'))
...:
172 ms ± 3.82 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [230]: %%timeit d = dict(zip(df2['age'], df2['val_main']))
...: #alternative with Series
...: #d = df2.set_index('age')['val_main']
...: df1['val'] = df1['val'].fillna(df1['age'].map(d))
...:
40 ms ± 145 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Upvotes: 3
Reputation: 3770
i like pd.apply
too
df1
id age val
0 10 2 10.0
1 20 3 NaN
2 30 4 55.0
3 40 5 NaN
df2
age val_main
0 1 111
1 2 222
2 3 333
3 4 444
4 5 555
5 6 666
df1['val'] = df1[['age','val']].apply(lambda x: df2[df2['age'] == x.age].reset_index()['val_main'][0] if str(x.val) == 'nan' else x.val, axis=1 )
Output
id age val
0 10 2 10.0
1 20 3 333.0
2 30 4 55.0
3 40 5 555.0
Upvotes: 0
Reputation: 88246
You could left .merge
and .fillna
the values in val
with the merged column val_main
:
df = df1.merge(df2, on = 'age', how='left')
df['val'] = df.val.fillna(df.pop('val_main'))
Output
id age val
0 10 2 10.0
1 20 3 333.0
2 30 4 55.0
3 40 5 555.0
Upvotes: 2
Reputation: 11192
I think you are looking for this,
df1=pd.merge(df1,df2,on=['age'],how='left')
df1['val']=df1['val'].fillna(df1['val_main'])
df1.drop('val_main',1,inplace=True) #to remove unnecessary column
print (df1)
O/P:
id age val
0 10 2 10.0
1 20 3 333.0
2 30 4 55.0
3 40 5 555.0
Upvotes: 1