k.ko3n
k.ko3n

Reputation: 954

Fill NaN with the mean of row from specific column more efficiently

Thank for making time to read my question.

I want to fill NaN in the following df with the average values from the columns which names started with 'A'.

For example, the first NaN should be filled with 2.5, which is the average of 2 and 3. The last NaN should be replaced with 1.5. Any values from column started with 'B' shall not be considered although they are in the same df.

A.1.a  A.3.d  A.6.i  B.2.b
NaN    2      3      12
1      2      3      12
1      NaN    3      12
1      2      3      12
NaN    2      3      12
1      2      NaN    12

Here is my attemp which is successful.

# read only columns which names started with A. 
cols_A = [col for col in df if col.startswith('A')]   
cols_A = df[cols_A]

cols_A = cols_A.apply(lambda row: row.fillna(row.mean()), axis=1)
cols_A

I am looking for a more efficient ways because my df has many more columns.

Upvotes: 0

Views: 319

Answers (3)

jezrael
jezrael

Reputation: 862681

Numpy solution for replace by only A starts columns:

#select only A starting columns
mask = df.columns.str.startswith('A')
df1 = df.loc[:, mask]
print (df1)
   A.1.a  A.3.d  A.6.i
0    NaN    2.0    3.0
1    1.0    2.0    3.0
2    1.0    NaN    3.0
3    1.0    2.0    3.0
4    NaN    2.0    3.0
5    1.0    2.0    NaN

#convert to 2d array
arr = df1.values
#broadcast to 2d array by df1 shape
a = np.broadcast_to(np.nanmean(arr, axis=1)[:, None], df1.shape)
#check missing values
m = np.isnan(arr)
#replace them by mask
arr[m] = a[m]
print (arr)
[[2.5 2.  3. ]
 [1.  2.  3. ]
 [1.  2.  3. ]
 [1.  2.  3. ]
 [2.5 2.  3. ]
 [1.  2.  1.5]]

#assign back
df.loc[:, mask] = arr
print (df)
   A.1.a  A.3.d  A.6.i  B.2.b
0    2.5    2.0    3.0     12
1    1.0    2.0    3.0     12
2    1.0    2.0    3.0     12
3    1.0    2.0    3.0     12
4    2.5    2.0    3.0     12
5    1.0    2.0    1.5     12

If need replace NaNs by groups defined by first value of columns:

df = df.combine_first(df.groupby(lambda x: x[0], axis=1).transform('mean'))
#alternative
#df = df.combine_first(df.groupby(df.columns.str[0], axis=1).transform('mean'))
print (df)
   A.1.a  A.3.d  A.6.i  B.2.b
0    2.5    2.0    3.0     12
1    1.0    2.0    3.0     12
2    1.0    2.0    3.0     12
3    1.0    2.0    3.0     12
4    2.5    2.0    3.0     12
5    1.0    2.0    1.5     12

Another idea is create dictionary of Series and replace by DataFrame.fillna:

df1 = df.groupby(df.columns.str[0], axis=1).mean()

df = df.fillna({x: df1[x[0]] for x in df.columns})
print (df)
   A.1.a  A.3.d  A.6.i  B.2.b
0    2.5    2.0    3.0     12
1    1.0    2.0    3.0     12
2    1.0    2.0    3.0     12
3    1.0    2.0    3.0     12
4    2.5    2.0    3.0     12
5    1.0    2.0    1.5     12

Upvotes: 1

Joe
Joe

Reputation: 12417

Another option is:

cols_A = cols_A.T.fillna(cols_A.mean(axis=1)).T

Output:

   A.1.a  A.3.d  A.6.i
0    2.5    2.0    3.0
1    1.0    2.0    3.0
2    1.0    2.0    3.0
3    1.0    2.0    3.0
4    2.5    2.0    3.0
5    1.0    2.0    1.5

Upvotes: 0

rafaelc
rafaelc

Reputation: 59274

IIUC, you can try broadcasting and fillna

cols = [x for x in df.columns if x.startswith('A')]
df.fillna(pd.DataFrame((df[cols].sum(1)/df[cols].notnull().sum(1)).values[:,None] * np.ones([len(cols),1]).T, columns=cols))

    A.1.a   A.3.d   A.6.i   B.2.b
0   2.5     2.0     3.0     12
1   1.0     2.0     3.0     12
2   1.0     2.0     3.0     12
3   1.0     2.0     3.0     12
4   2.5     2.0     3.0     12
5   1.0     2.0     1.5     12

Good timings

df = pd.concat([df]*1000).reset_index(drop=True)

%timeit df.fillna(pd.DataFrame(df[cols].sum(1).div(df[cols].notnull().sum(1)).values[:,None] * np.ones([len(cols),1]).T, columns=cols))
5.73 ms ± 272 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit df.combine_first(df.groupby(lambda x: x[0], axis=1).transform('mean'))
856 ms ± 22.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Upvotes: 1

Related Questions