fill NaN with values of a similar row

Question

My dataframe has these 2 columns: "reach" and "height". the column "reach" has a lot of missing value. But the column 'height' have all the value needed. What I see is that reach is often a function of height. Therefore, for the rows with NaN, I want to look at the height, then find another row with the same "height" and that has "reach" available, then copy this value to the 1 with missing value

name	SApM	SLpM	height	reach	record	stance	strAcc	strDef	subAvg	tdAcc	tdAvg	tdDef	weight	born_year	win	lose	draw	nc
Justin Frazier	6.11	1.11	6' 0"	75	10-3-0	Southpaw	0.66	0.04	0	0	0	0	265	1989	10	3	0
Gleidson Cutis	8.28	2.99	5' 9"	nan	7-4-0	Orthodox	0.52	0.59	0	0	0	0	155	1989	7	4	0
Xavier Foupa-Pokam	2.5	1.47	6' 1"	nan	32-22-0	Open Stance	0.43	0.49	0	0	0	0.16	185	1982	32	22	0
Mirko Filipovic	1.89	2.11	6' 2"	73	35-11-2-(1 NC)	Southpaw	0.5	0.63	0.3	0.4	0.19	0.78	230	1974	35	11	2	1
Jordan Johnson	2.64	3.45	6' 2"	79	10-0-0	Orthodox	0.47	0.53	1.2	0.42	3.25	1	205	1988	10	0	0
Martin Kampmann	3.28	3.22	6' 0"	72	20-7-0	Orthodox	0.42	0.62	2	0.41	1.86	0.78	170	1982	20	7	0
Darren Elkins	3.05	3.46	5' 10"	71	27-9-0	Orthodox	0.38	0.52	1.1	0.33	2.67	0.56	145	1984	27	9	0
Austen Lane	6.32	5.26	6' 6"	nan	2-1-0	Orthodox	0.35	0.6	0	0	0	0	245	1987	2	1	0
Rachael Ostovich	3.97	2.54	5' 3"	62	4-6-0	Orthodox	0.43	0.57	0.8	0.83	2.03	0.66	125	1991	4	6	0
Travis Lutter	2.42	0.41	5' 11"	75	10-6-0	Orthodox	0.32	0.42	0.7	0.24	1.95	0.3	185	1973	10	6	0
Tom Murphy	0.17	2.5	6' 2"	nan	8-0-0	Southpaw	0.71	0.84	2.5	0.85	7.51	0	227	1974	8	0	0
Darrell Montague	5.38	1.92	5' 6"	67	13-5-0	Southpaw	0.25	0.52	1.4	0.25	0.72	0.33	125	1987	13	5	0
Lauren Murphy	4.25	3.95	5' 5"	67	15-4-0	Orthodox	0.4	0.61	0.1	0.34	1.16	0.7	125	1983	15	4	0
Bill Mahood	3.59	1.54	6' 3"	nan	20-7-1-(1 NC)	Orthodox	0.85	0.17	3.9	0	0	0	200	1967	20	7	1	1
Nate Marquardt	2.32	2.71	6' 0"	74	35-19-2	Orthodox	0.49	0.55	0.8	0.51	1.87	0.7	185	1979	35	19	2
Mike Polchlopek	1.33	2	6' 4"	nan	1-1-0	Orthodox	0.38	0.57	0	0	0	0	285	1965	1	1	0
Harvey Park	7.21	3.77	6' 0"	70	12-3-0	Orthodox	0.5	0.33	0	0	0	0	155	1986	12	3	0
Junyong Park	3.17	4.37	5' 10"	73	13-4-0	Orthodox	0.47	0.58	0.6	0.57	3.02	0.46	185	1991	13	4	0
Ricco Rodriguez	1.15	1.85	6' 4"	nan	53-25-0-(1 NC)	Orthodox	0.51	0.61	1	0.39	2.3	0.4	265	1977	53	25	0	1
Aaron Riley	3.78	3.45	5' 8"	69	30-14-1	Southpaw	0.34	0.61	0.1	0.34	1.18	0.6	155	1980	30	14	1

SeaBean · Accepted Answer

You can create a height reference dataframe with .groupby() and fetch the first non-NaN entry of a height (if any) by .first(), as follows:

height_ref = df.groupby('height')['reach'].first()


height
5' 10"    71.0
5' 11"    75.0
5' 3"     62.0
5' 5"     67.0
5' 6"     67.0
5' 8"     69.0
5' 9"      NaN
6' 0"     75.0
6' 1"      NaN
6' 2"     73.0
6' 3"      NaN
6' 4"      NaN
6' 6"      NaN
Name: reach, dtype: float64

Then, you can fill up the NaN values of column reach by looking up the height reference dataframe by .map() and use .fillna() to fill-up values, as follows:

df['reach2'] = df['reach'].fillna(df['height'].map(height_ref))

For demo purpose, I update to a new column reach2. You can overwrite the original column reach as appropriate.

Result:

print(df[['height', 'reach', 'reach2']])


    height  reach  reach2
0    6' 0"   75.0    75.0
1    5' 9"    NaN     NaN
2    6' 1"    NaN     NaN
3    6' 2"   73.0    73.0
4    6' 2"   79.0    79.0
5    6' 0"   72.0    72.0
6   5' 10"   71.0    71.0
7    6' 6"    NaN     NaN
8    5' 3"   62.0    62.0
9   5' 11"   75.0    75.0
10   6' 2"    NaN    73.0      <=======  filled up with referenced height from other row
11   5' 6"   67.0    67.0
12   5' 5"   67.0    67.0
13   6' 3"    NaN     NaN
14   6' 0"   74.0    74.0
15   6' 4"    NaN     NaN
16   6' 0"   70.0    70.0
17  5' 10"   73.0    73.0
18   6' 4"    NaN     NaN
19   5' 8"   69.0    69.0

fill NaN with values of a similar row

Answers (2)

Related Questions