Reputation: 13062
I have a dataframe where certain essential columns are NULL (that I need for further machine learning work). I have another dataframe with similar data from where I want to pull in the missing values from.
For example, df1
is the main dataframe
id col1 col2 col3 col4 col5
1 A AA 100 5.0 0.9
2 A BB 150 4.2 0.5
3 A CC 100 NaN NaN
4 B AA 300 NaN NaN
5 B BB 100 NaN NaN
6 C BB 50 3.4 0.6
The dataframe that I want to fill those NaN
columns in col4
and col5
could be like
id col1 col3 col4 col5
100 A 100 4.5 1.0
101 A 100 3.5 0.8
103 B 300 5.0 0.5
105 B 300 5.5 0.8
106 B 100 5.3 0.2
107 C 100 3.0 1.2
So, I don't have col2
in the second df and there are duplicates for the col1
and col2
columns that I can merge by. So, I have to choose the value with the maximum col4
value to fill the corresponding values in df1
.
For example, the correct value for df1
after filling in the data would be:
id col1 col2 col3 col4 col5
1 A AA 100 5.0 0.9
1 A BB 150 4.2 0.5
1 A CC 100 4.5 1.0
1 B AA 300 5.5 0.8
1 B BB 100 5.3 0.2
1 C BB 50 3.4 0.6
How would I do that?
Upvotes: 2
Views: 727
Reputation: 880329
import numpy as np
import pandas as pd
nan = np.nan
df1 = pd.DataFrame({'col1': ['A', 'A', 'A', 'B', 'B', 'C'], 'col2': ['AA', 'BB', 'CC', 'AA', 'BB', 'BB'], 'col3': [100, 150, 100, 300, 100, 50], 'col4': [5.0, 4.2, nan, nan, nan, 3.4], 'col5': [0.9, 0.5, nan, nan, nan, 0.6], 'id': [1, 2, 3, 4, 5, 6]})
df2 = pd.DataFrame({'col1': ['A', 'A', 'B', 'B', 'B', 'C'], 'col3': [100, 100, 300, 300, 100, 100], 'col4': [4.5, 3.5, 5.0, 5.5, 5.3, 3.0], 'col5': [1.0, 99, 0.5, 0.8, 0.2, 1.2], 'id': [100, 101, 103, 105, 106, 107]})
df2_max = df2.drop('id', axis=1).groupby(['col1','col3']).max()
df3 = pd.merge(df1[['col1','col3']], df2_max,
left_on=['col1','col3'], right_index=True, how='left')
result = df1.combine_first(df3)
yields
col1 col2 col3 col4 col5 id
0 A AA 100 5.0 0.9 1
1 A BB 150 4.2 0.5 2
2 A CC 100 4.5 99.0 3
3 B AA 300 5.5 0.8 4
4 B BB 100 5.3 0.2 5
5 C BB 50 3.4 0.6 6
First, find the max of df2
's col4
and col5
columns for each value of col1
,col3
:
df2_max = df2.drop('id', axis=1).groupby(['col1','col3']).max()
# col4 col5
# col1 col3
# A 100 4.5 99.0
# B 100 5.3 0.2
# 300 5.5 0.8
# C 100 3.0 1.2
Notice the 99 in the first row (instead of 0.8). I changed the example
slightly, to show that the col4
maximum need not occur in the same row as the
col5
maximum.
Next, merge df1
and df2_max
, on df1
's col1
,col3
columns, and df2
's index:
df3 = pd.merge(df1[['col1','col3']], df2_max,
left_on=['col1','col3'], right_index=True, how='left')
# col1 col3 col4 col5
# 0 A 100 4.5 99.0
# 1 A 150 NaN NaN
# 2 A 100 4.5 99.0
# 3 B 300 5.5 0.8
# 4 B 100 5.3 0.2
# 5 C 50 NaN NaN
This provides us with the maximums in col4
and col5
in a DataFrame whose index matches df1
's index. This allows us to use df1.combine_first
to fill in the NaNs with values from df3
:
result = df1.combine_first(df3)
Upvotes: 1
Reputation: 294488
IIUC
df1.combine_first(
df1.merge(
df2.drop('id', 1).ix[df2.groupby(['col1', 'col3']).col4.idxmax()],
on=['col1', 'col3'], how='left', suffixes=['_', '']
)[['col4', 'col5']]
).reindex_axis(df1.columns, 1)
Upvotes: 1