Reputation: 917
I am using the sklearn.preprocessing.Imputer class to impute NaN values using a mean strategy over the columns, i.e. axis=0. My problem is that some data which needs to be imputed only has NaN values in it's column, e.g. when there is only a single entry.
import numpy as np
import sklearn.preprocessing import Imputer
data = np.array([[1, 2, np.NaN]])
data = Imputer().fit_transform(data)
This gives an output of array([[1., 2.]])
Fair enough, obviously the Imputer cannot compute a mean for a set of values which are all NaN. However, instead of removing the value I would like to fall back to a default value, in my case 0.
To solve this problem I first check whether an entire column only contains NaN values, and if so, replace them with my default value 0:
# Loop over all columns in data
for column in data.T:
# Check if all values in column are NaN
if all(np.isnan(value) for value in column):
# Fill the column with default value 0
column.fill(0)
Is there a more elegant way to impute to a default value if an entire axis only contains NaN values?
Upvotes: 4
Views: 1538
Reputation: 12157
This is a vectorized solution to do what you're doing in a for loop and so should be much faster
default = 0
data[:, np.isnan(data).all(axis=0)] = default
You can then apply your Imputer().fit_transform()
method to the new data
.
Example
data = np.array([[np.nan, 1, 1], [np.nan]*3, [1, 2, 3]]).T
which looks like
[[nan nan 1.]
[ 1. nan 2.]
[ 1. nan 3.]]
Applying our method to remove nan
s
default = 0
data[:, np.isnan(data).all(axis=0)] = default
and we get
[[nan 0. 1.]
[ 1. 0. 2.]
[ 1. 0. 3.]]
Upvotes: 3