Reputation: 145
I have a pandas dataframe with two columns : locationid, geo_loc. locationid column has missing values.
I want to get the geo_loc value of the missing locationid row, then search this geo_loc value in geo_loc column and get the loction id.
df1 = pd.DataFrame({'locationid':[111, np.nan, 145, np.nan, 189,np.nan, 158, 145],
'geo_loc':['G12','K11','B16','G12','B22','B16', 'K11',he l 'B16']})
df
I need the final output like this:
index 1 of locationid is missing and the corresponding geo_loc value is 'K11'. I would look this 'K11' in geo_loc column and index 6 has locationid 158. With this value I want to fill the missing value in index 1.
I tried these codes and they didnt work.
df1['locationid'] = df1.locationid.fillna(df1.groupby('geo_loc')['locationid'].max())
df1['locationid'] = df1.locationid.fillna(df1.groupby('geo_loc').apply(lambda x: print(list(x.locationid)[0])))
Upvotes: 3
Views: 119
Reputation: 862641
Use GroupBy.transform
for Series with same size like original filled by aggregate values max
:
df1['locationid']=df1.locationid.fillna(df1.groupby('geo_loc')['locationid'].transform('max'))
print (df1)
locationid geo_loc
0 111.0 G12
1 158.0 K11
2 145.0 B16
3 111.0 G12
4 189.0 B22
5 145.0 B16
6 158.0 K11
7 145.0 B16
If values are strings is is possible by trick - remove missing values with Series.dropna
in lambda function, strings are compared lexicographically:
df1 = pd.DataFrame({'locationid':[111, np.nan, 145, np.nan, 189,np.nan, 158, 145],
'geo_loc':['G12','K11','B16','G12','B22','B16', 'K11', 'B16']})
#sample data strings with missing values
df1['locationid'] = df1['locationid'].dropna().astype(str) + 'a'
df1['locationid']= (df1.groupby('geo_loc')['locationid']
.transform(lambda x: x.fillna(x.dropna().max())))
print (df1)
locationid geo_loc
0 111.0a G12
1 158.0a K11
2 145.0a B16
3 111.0a G12
4 189.0a B22
5 145.0a B16
6 158.0a K11
7 145.0a B16
Upvotes: 2