Reputation: 954
Thank for making time to read my question.
I want to fill NaN in the following df with the average values from the columns which names started with 'A'.
For example, the first NaN should be filled with 2.5, which is the average of 2 and 3. The last NaN should be replaced with 1.5. Any values from column started with 'B' shall not be considered although they are in the same df.
A.1.a A.3.d A.6.i B.2.b
NaN 2 3 12
1 2 3 12
1 NaN 3 12
1 2 3 12
NaN 2 3 12
1 2 NaN 12
Here is my attemp which is successful.
# read only columns which names started with A.
cols_A = [col for col in df if col.startswith('A')]
cols_A = df[cols_A]
cols_A = cols_A.apply(lambda row: row.fillna(row.mean()), axis=1)
cols_A
I am looking for a more efficient ways because my df has many more columns.
Upvotes: 0
Views: 319
Reputation: 862681
Numpy solution for replace by only A
starts columns:
#select only A starting columns
mask = df.columns.str.startswith('A')
df1 = df.loc[:, mask]
print (df1)
A.1.a A.3.d A.6.i
0 NaN 2.0 3.0
1 1.0 2.0 3.0
2 1.0 NaN 3.0
3 1.0 2.0 3.0
4 NaN 2.0 3.0
5 1.0 2.0 NaN
#convert to 2d array
arr = df1.values
#broadcast to 2d array by df1 shape
a = np.broadcast_to(np.nanmean(arr, axis=1)[:, None], df1.shape)
#check missing values
m = np.isnan(arr)
#replace them by mask
arr[m] = a[m]
print (arr)
[[2.5 2. 3. ]
[1. 2. 3. ]
[1. 2. 3. ]
[1. 2. 3. ]
[2.5 2. 3. ]
[1. 2. 1.5]]
#assign back
df.loc[:, mask] = arr
print (df)
A.1.a A.3.d A.6.i B.2.b
0 2.5 2.0 3.0 12
1 1.0 2.0 3.0 12
2 1.0 2.0 3.0 12
3 1.0 2.0 3.0 12
4 2.5 2.0 3.0 12
5 1.0 2.0 1.5 12
If need replace NaN
s by groups defined by first value of columns:
df = df.combine_first(df.groupby(lambda x: x[0], axis=1).transform('mean'))
#alternative
#df = df.combine_first(df.groupby(df.columns.str[0], axis=1).transform('mean'))
print (df)
A.1.a A.3.d A.6.i B.2.b
0 2.5 2.0 3.0 12
1 1.0 2.0 3.0 12
2 1.0 2.0 3.0 12
3 1.0 2.0 3.0 12
4 2.5 2.0 3.0 12
5 1.0 2.0 1.5 12
Another idea is create dictionary of Series
and replace by DataFrame.fillna
:
df1 = df.groupby(df.columns.str[0], axis=1).mean()
df = df.fillna({x: df1[x[0]] for x in df.columns})
print (df)
A.1.a A.3.d A.6.i B.2.b
0 2.5 2.0 3.0 12
1 1.0 2.0 3.0 12
2 1.0 2.0 3.0 12
3 1.0 2.0 3.0 12
4 2.5 2.0 3.0 12
5 1.0 2.0 1.5 12
Upvotes: 1
Reputation: 12417
Another option is:
cols_A = cols_A.T.fillna(cols_A.mean(axis=1)).T
Output:
A.1.a A.3.d A.6.i
0 2.5 2.0 3.0
1 1.0 2.0 3.0
2 1.0 2.0 3.0
3 1.0 2.0 3.0
4 2.5 2.0 3.0
5 1.0 2.0 1.5
Upvotes: 0
Reputation: 59274
IIUC, you can try broadcasting
and fillna
cols = [x for x in df.columns if x.startswith('A')]
df.fillna(pd.DataFrame((df[cols].sum(1)/df[cols].notnull().sum(1)).values[:,None] * np.ones([len(cols),1]).T, columns=cols))
A.1.a A.3.d A.6.i B.2.b
0 2.5 2.0 3.0 12
1 1.0 2.0 3.0 12
2 1.0 2.0 3.0 12
3 1.0 2.0 3.0 12
4 2.5 2.0 3.0 12
5 1.0 2.0 1.5 12
Good timings
df = pd.concat([df]*1000).reset_index(drop=True)
%timeit df.fillna(pd.DataFrame(df[cols].sum(1).div(df[cols].notnull().sum(1)).values[:,None] * np.ones([len(cols),1]).T, columns=cols))
5.73 ms ± 272 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df.combine_first(df.groupby(lambda x: x[0], axis=1).transform('mean'))
856 ms ± 22.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Upvotes: 1