Reputation: 281
My dataframe has these 2 columns: "reach" and "height". the column "reach" has a lot of missing value. But the column 'height' have all the value needed. What I see is that reach is often a function of height. Therefore, for the rows with NaN, I want to look at the height, then find another row with the same "height" and that has "reach" available, then copy this value to the 1 with missing value
name | SApM | SLpM | height | reach | record | stance | strAcc | strDef | subAvg | tdAcc | tdAvg | tdDef | weight | born_year | win | lose | draw | nc |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Justin Frazier | 6.11 | 1.11 | 6' 0" | 75 | 10-3-0 | Southpaw | 0.66 | 0.04 | 0 | 0 | 0 | 0 | 265 | 1989 | 10 | 3 | 0 | |
Gleidson Cutis | 8.28 | 2.99 | 5' 9" | nan | 7-4-0 | Orthodox | 0.52 | 0.59 | 0 | 0 | 0 | 0 | 155 | 1989 | 7 | 4 | 0 | |
Xavier Foupa-Pokam | 2.5 | 1.47 | 6' 1" | nan | 32-22-0 | Open Stance | 0.43 | 0.49 | 0 | 0 | 0 | 0.16 | 185 | 1982 | 32 | 22 | 0 | |
Mirko Filipovic | 1.89 | 2.11 | 6' 2" | 73 | 35-11-2-(1 NC) | Southpaw | 0.5 | 0.63 | 0.3 | 0.4 | 0.19 | 0.78 | 230 | 1974 | 35 | 11 | 2 | 1 |
Jordan Johnson | 2.64 | 3.45 | 6' 2" | 79 | 10-0-0 | Orthodox | 0.47 | 0.53 | 1.2 | 0.42 | 3.25 | 1 | 205 | 1988 | 10 | 0 | 0 | |
Martin Kampmann | 3.28 | 3.22 | 6' 0" | 72 | 20-7-0 | Orthodox | 0.42 | 0.62 | 2 | 0.41 | 1.86 | 0.78 | 170 | 1982 | 20 | 7 | 0 | |
Darren Elkins | 3.05 | 3.46 | 5' 10" | 71 | 27-9-0 | Orthodox | 0.38 | 0.52 | 1.1 | 0.33 | 2.67 | 0.56 | 145 | 1984 | 27 | 9 | 0 | |
Austen Lane | 6.32 | 5.26 | 6' 6" | nan | 2-1-0 | Orthodox | 0.35 | 0.6 | 0 | 0 | 0 | 0 | 245 | 1987 | 2 | 1 | 0 | |
Rachael Ostovich | 3.97 | 2.54 | 5' 3" | 62 | 4-6-0 | Orthodox | 0.43 | 0.57 | 0.8 | 0.83 | 2.03 | 0.66 | 125 | 1991 | 4 | 6 | 0 | |
Travis Lutter | 2.42 | 0.41 | 5' 11" | 75 | 10-6-0 | Orthodox | 0.32 | 0.42 | 0.7 | 0.24 | 1.95 | 0.3 | 185 | 1973 | 10 | 6 | 0 | |
Tom Murphy | 0.17 | 2.5 | 6' 2" | nan | 8-0-0 | Southpaw | 0.71 | 0.84 | 2.5 | 0.85 | 7.51 | 0 | 227 | 1974 | 8 | 0 | 0 | |
Darrell Montague | 5.38 | 1.92 | 5' 6" | 67 | 13-5-0 | Southpaw | 0.25 | 0.52 | 1.4 | 0.25 | 0.72 | 0.33 | 125 | 1987 | 13 | 5 | 0 | |
Lauren Murphy | 4.25 | 3.95 | 5' 5" | 67 | 15-4-0 | Orthodox | 0.4 | 0.61 | 0.1 | 0.34 | 1.16 | 0.7 | 125 | 1983 | 15 | 4 | 0 | |
Bill Mahood | 3.59 | 1.54 | 6' 3" | nan | 20-7-1-(1 NC) | Orthodox | 0.85 | 0.17 | 3.9 | 0 | 0 | 0 | 200 | 1967 | 20 | 7 | 1 | 1 |
Nate Marquardt | 2.32 | 2.71 | 6' 0" | 74 | 35-19-2 | Orthodox | 0.49 | 0.55 | 0.8 | 0.51 | 1.87 | 0.7 | 185 | 1979 | 35 | 19 | 2 | |
Mike Polchlopek | 1.33 | 2 | 6' 4" | nan | 1-1-0 | Orthodox | 0.38 | 0.57 | 0 | 0 | 0 | 0 | 285 | 1965 | 1 | 1 | 0 | |
Harvey Park | 7.21 | 3.77 | 6' 0" | 70 | 12-3-0 | Orthodox | 0.5 | 0.33 | 0 | 0 | 0 | 0 | 155 | 1986 | 12 | 3 | 0 | |
Junyong Park | 3.17 | 4.37 | 5' 10" | 73 | 13-4-0 | Orthodox | 0.47 | 0.58 | 0.6 | 0.57 | 3.02 | 0.46 | 185 | 1991 | 13 | 4 | 0 | |
Ricco Rodriguez | 1.15 | 1.85 | 6' 4" | nan | 53-25-0-(1 NC) | Orthodox | 0.51 | 0.61 | 1 | 0.39 | 2.3 | 0.4 | 265 | 1977 | 53 | 25 | 0 | 1 |
Aaron Riley | 3.78 | 3.45 | 5' 8" | 69 | 30-14-1 | Southpaw | 0.34 | 0.61 | 0.1 | 0.34 | 1.18 | 0.6 | 155 | 1980 | 30 | 14 | 1 |
Upvotes: 1
Views: 1528
Reputation: 23217
You can create a height reference dataframe with .groupby()
and fetch the first non-NaN entry of a height (if any) by .first()
, as follows:
height_ref = df.groupby('height')['reach'].first()
height
5' 10" 71.0
5' 11" 75.0
5' 3" 62.0
5' 5" 67.0
5' 6" 67.0
5' 8" 69.0
5' 9" NaN
6' 0" 75.0
6' 1" NaN
6' 2" 73.0
6' 3" NaN
6' 4" NaN
6' 6" NaN
Name: reach, dtype: float64
Then, you can fill up the NaN
values of column reach
by looking up the height reference dataframe by .map()
and use .fillna()
to fill-up values, as follows:
df['reach2'] = df['reach'].fillna(df['height'].map(height_ref))
For demo purpose, I update to a new column reach2
. You can overwrite the original column reach
as appropriate.
Result:
print(df[['height', 'reach', 'reach2']])
height reach reach2
0 6' 0" 75.0 75.0
1 5' 9" NaN NaN
2 6' 1" NaN NaN
3 6' 2" 73.0 73.0
4 6' 2" 79.0 79.0
5 6' 0" 72.0 72.0
6 5' 10" 71.0 71.0
7 6' 6" NaN NaN
8 5' 3" 62.0 62.0
9 5' 11" 75.0 75.0
10 6' 2" NaN 73.0 <======= filled up with referenced height from other row
11 5' 6" 67.0 67.0
12 5' 5" 67.0 67.0
13 6' 3" NaN NaN
14 6' 0" 74.0 74.0
15 6' 4" NaN NaN
16 6' 0" 70.0 70.0
17 5' 10" 73.0 73.0
18 6' 4" NaN NaN
19 5' 8" 69.0 69.0
Upvotes: 2
Reputation: 195
I think that a method does not exist to do that in a simple step. If i were in your shoes I would:
height|reach
fully populated, in which I would store my best guess valuesheight
as keyNaN
appears: df.reach = df.reach.fillna(df.from_support_dataset_height)
Upvotes: 1