le Minh Nguyen
le Minh Nguyen

Reputation: 281

fill NaN with values of a similar row

My dataframe has these 2 columns: "reach" and "height". the column "reach" has a lot of missing value. But the column 'height' have all the value needed. What I see is that reach is often a function of height. Therefore, for the rows with NaN, I want to look at the height, then find another row with the same "height" and that has "reach" available, then copy this value to the 1 with missing value

name SApM SLpM height reach record stance strAcc strDef subAvg tdAcc tdAvg tdDef weight born_year win lose draw nc
Justin Frazier 6.11 1.11 6' 0" 75 10-3-0 Southpaw 0.66 0.04 0 0 0 0 265 1989 10 3 0
Gleidson Cutis 8.28 2.99 5' 9" nan 7-4-0 Orthodox 0.52 0.59 0 0 0 0 155 1989 7 4 0
Xavier Foupa-Pokam 2.5 1.47 6' 1" nan 32-22-0 Open Stance 0.43 0.49 0 0 0 0.16 185 1982 32 22 0
Mirko Filipovic 1.89 2.11 6' 2" 73 35-11-2-(1 NC) Southpaw 0.5 0.63 0.3 0.4 0.19 0.78 230 1974 35 11 2 1
Jordan Johnson 2.64 3.45 6' 2" 79 10-0-0 Orthodox 0.47 0.53 1.2 0.42 3.25 1 205 1988 10 0 0
Martin Kampmann 3.28 3.22 6' 0" 72 20-7-0 Orthodox 0.42 0.62 2 0.41 1.86 0.78 170 1982 20 7 0
Darren Elkins 3.05 3.46 5' 10" 71 27-9-0 Orthodox 0.38 0.52 1.1 0.33 2.67 0.56 145 1984 27 9 0
Austen Lane 6.32 5.26 6' 6" nan 2-1-0 Orthodox 0.35 0.6 0 0 0 0 245 1987 2 1 0
Rachael Ostovich 3.97 2.54 5' 3" 62 4-6-0 Orthodox 0.43 0.57 0.8 0.83 2.03 0.66 125 1991 4 6 0
Travis Lutter 2.42 0.41 5' 11" 75 10-6-0 Orthodox 0.32 0.42 0.7 0.24 1.95 0.3 185 1973 10 6 0
Tom Murphy 0.17 2.5 6' 2" nan 8-0-0 Southpaw 0.71 0.84 2.5 0.85 7.51 0 227 1974 8 0 0
Darrell Montague 5.38 1.92 5' 6" 67 13-5-0 Southpaw 0.25 0.52 1.4 0.25 0.72 0.33 125 1987 13 5 0
Lauren Murphy 4.25 3.95 5' 5" 67 15-4-0 Orthodox 0.4 0.61 0.1 0.34 1.16 0.7 125 1983 15 4 0
Bill Mahood 3.59 1.54 6' 3" nan 20-7-1-(1 NC) Orthodox 0.85 0.17 3.9 0 0 0 200 1967 20 7 1 1
Nate Marquardt 2.32 2.71 6' 0" 74 35-19-2 Orthodox 0.49 0.55 0.8 0.51 1.87 0.7 185 1979 35 19 2
Mike Polchlopek 1.33 2 6' 4" nan 1-1-0 Orthodox 0.38 0.57 0 0 0 0 285 1965 1 1 0
Harvey Park 7.21 3.77 6' 0" 70 12-3-0 Orthodox 0.5 0.33 0 0 0 0 155 1986 12 3 0
Junyong Park 3.17 4.37 5' 10" 73 13-4-0 Orthodox 0.47 0.58 0.6 0.57 3.02 0.46 185 1991 13 4 0
Ricco Rodriguez 1.15 1.85 6' 4" nan 53-25-0-(1 NC) Orthodox 0.51 0.61 1 0.39 2.3 0.4 265 1977 53 25 0 1
Aaron Riley 3.78 3.45 5' 8" 69 30-14-1 Southpaw 0.34 0.61 0.1 0.34 1.18 0.6 155 1980 30 14 1

Upvotes: 1

Views: 1528

Answers (2)

SeaBean
SeaBean

Reputation: 23217

You can create a height reference dataframe with .groupby() and fetch the first non-NaN entry of a height (if any) by .first(), as follows:

height_ref = df.groupby('height')['reach'].first()


height
5' 10"    71.0
5' 11"    75.0
5' 3"     62.0
5' 5"     67.0
5' 6"     67.0
5' 8"     69.0
5' 9"      NaN
6' 0"     75.0
6' 1"      NaN
6' 2"     73.0
6' 3"      NaN
6' 4"      NaN
6' 6"      NaN
Name: reach, dtype: float64

Then, you can fill up the NaN values of column reach by looking up the height reference dataframe by .map() and use .fillna() to fill-up values, as follows:

df['reach2'] = df['reach'].fillna(df['height'].map(height_ref))

For demo purpose, I update to a new column reach2. You can overwrite the original column reach as appropriate.

Result:

print(df[['height', 'reach', 'reach2']])


    height  reach  reach2
0    6' 0"   75.0    75.0
1    5' 9"    NaN     NaN
2    6' 1"    NaN     NaN
3    6' 2"   73.0    73.0
4    6' 2"   79.0    79.0
5    6' 0"   72.0    72.0
6   5' 10"   71.0    71.0
7    6' 6"    NaN     NaN
8    5' 3"   62.0    62.0
9   5' 11"   75.0    75.0
10   6' 2"    NaN    73.0      <=======  filled up with referenced height from other row
11   5' 6"   67.0    67.0
12   5' 5"   67.0    67.0
13   6' 3"    NaN     NaN
14   6' 0"   74.0    74.0
15   6' 4"    NaN     NaN
16   6' 0"   70.0    70.0
17  5' 10"   73.0    73.0
18   6' 4"    NaN     NaN
19   5' 8"   69.0    69.0

Upvotes: 2

Mirco
Mirco

Reputation: 195

I think that a method does not exist to do that in a simple step. If i were in your shoes I would:

  1. Create a support dataset made up of height|reach fully populated, in which I would store my best guess values
  2. Join the support dataframe with the existing ones, using height as key
  3. Coalesce the values where NaN appears: df.reach = df.reach.fillna(df.from_support_dataset_height)

Upvotes: 1

Related Questions