Thijs van Ede
Thijs van Ede

Reputation: 917

How to impute NaN values to a default value if strategy fails?

Problem

I am using the sklearn.preprocessing.Imputer class to impute NaN values using a mean strategy over the columns, i.e. axis=0. My problem is that some data which needs to be imputed only has NaN values in it's column, e.g. when there is only a single entry.

import numpy as np
import sklearn.preprocessing import Imputer

data = np.array([[1, 2, np.NaN]])
data = Imputer().fit_transform(data)

This gives an output of array([[1., 2.]])

Fair enough, obviously the Imputer cannot compute a mean for a set of values which are all NaN. However, instead of removing the value I would like to fall back to a default value, in my case 0.

Current approach

To solve this problem I first check whether an entire column only contains NaN values, and if so, replace them with my default value 0:

# Loop over all columns in data
for column in data.T:
    # Check if all values in column are NaN
    if all(np.isnan(value) for value in column):
        # Fill the column with default value 0
        column.fill(0)

Question

Is there a more elegant way to impute to a default value if an entire axis only contains NaN values?

Upvotes: 4

Views: 1538

Answers (1)

FHTMitchell
FHTMitchell

Reputation: 12157

This is a vectorized solution to do what you're doing in a for loop and so should be much faster

default = 0
data[:, np.isnan(data).all(axis=0)] = default

You can then apply your Imputer().fit_transform() method to the new data.


Example

data = np.array([[np.nan, 1, 1], [np.nan]*3, [1, 2, 3]]).T

which looks like

[[nan nan  1.]
 [ 1. nan  2.]
 [ 1. nan  3.]]

Applying our method to remove nans

default = 0
data[:, np.isnan(data).all(axis=0)] = default

and we get

[[nan  0.  1.]
 [ 1.  0.  2.]
 [ 1.  0.  3.]]

Upvotes: 3

Related Questions