Effect of scaling Features when NaNs are set to -1

Question

I have a dataset containing some features with quite a lot of NaNs (up to 80%). Removing them, would skew my overall distribution, hence my options are to set all NaNs to -1/-99 or bin my continuous variable into groups, making it a categorical feature.

As I already have many categorical features, I'd rather not make the few continuous ones, categorical too. However, if I set NaNs to -1/-99 will that significantly affect results when I scale those features?

Or from a different perspective, is there a way of scaling features without having the -1 affect its scaling too much?

TayTay · Accepted Answer

I know that you got the answer from the comments above, but in an effort to show new scikit-learn users how you might approach a problem like this, I've put together a very rudimentary solution that demonstrates how to build a custom transformer that would handle this:

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.utils.validation import check_array, check_is_fitted
import numpy as np

class NanImputeScaler(BaseEstimator, TransformerMixin):
    """Scale an array with missing values, then impute them
    with a dummy value. This prevents the imputed value from impacting
    the mean/standard deviation computation during scaling.

    Parameters
    ----------
    with_mean : bool, optional (default=True)
        Whether to center the variables.

    with_std : bool, optional (default=True)
        Whether to divide by the standard deviation.

    nan_level : int or float, optional (default=-99.)
        The value to impute over NaN values after scaling the other features.
    """
    def __init__(self, with_mean=True, with_std=True, nan_level=-99.):
        self.with_mean = with_mean
        self.with_std = with_std
        self.nan_level = nan_level

    def fit(self, X, y=None):
        # Check the input array, but don't force everything to be finite.
        # This also ensures the array is 2D
        X = check_array(X, force_all_finite=False, ensure_2d=True)

        # compute the statistics on the data irrespective of NaN values
        self.means_ = np.nanmean(X, axis=0)
        self.std_ = np.nanstd(X, axis=0)
        return self

    def transform(self, X):
        # Check that we have already fit this transformer
        check_is_fitted(self, "means_")

        # get a copy of X so we can change it in place
        X = check_array(X, force_all_finite=False, ensure_2d=True)

        # center if needed
        if self.with_mean:
            X -= self.means_
        # scale if needed
        if self.with_std:
            X /= self.std_

        # now fill in the missing values
        X[np.isnan(X)] = self.nan_level
        return X

The way this works is by computing the nanmean and nanstd in the fit section so that the NaN values are ignored while computing the statistics. Then, in the transform section, after the variables are scaled and centered, the remaining NaN values are imputed with the value you designate (you mentioned -99, so that's what I defaulted to). You could always break that component of the transformer into another transformer, but I included it just for demonstration purposes.

Example in action:

Here we'll set up some data with NaNs present:

nan = np.nan
data = np.array([
    [ 1., nan,  3.],
    [ 2.,  3., nan],
    [nan,  4.,  5.],
    [ 4.,  5.,  6.]
])

And when we fit the scaler and examine the means/standard deviations, you can see that they did not account for the NaN values:

>>> imputer = NanImputeScaler().fit(data)
>>> imputer.means_
array([ 2.33333333,  4.        ,  4.66666667])
>>> imputer.std_
array([ 1.24721913,  0.81649658,  1.24721913])

Finally, when we transform the data, the data is scaled and the NaN values are handled:

>>> imputer.transform(data)
array([[ -1.06904497, -99.        ,  -1.33630621],
       [ -0.26726124,  -1.22474487, -99.        ],
       [-99.        ,   0.        ,   0.26726124],
       [  1.33630621,   1.22474487,   1.06904497]])

Pipelining

You can even use this pattern inside of a scikit-learn pipeline (and even persist it to disk):

from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
pipe = Pipeline([
        ("scale", NanImputeScaler()),
        ("clf", LogisticRegression())
    ]).fit(data, y)

Effect of scaling Features when NaNs are set to -1

Answers (1)

Example in action:

Pipelining

Related Questions