Reputation: 85
I have a dataset containing some features with quite a lot of NaNs (up to 80%). Removing them, would skew my overall distribution, hence my options are to set all NaNs to -1/-99 or bin my continuous variable into groups, making it a categorical feature.
As I already have many categorical features, I'd rather not make the few continuous ones, categorical too. However, if I set NaNs to -1/-99 will that significantly affect results when I scale those features?
Or from a different perspective, is there a way of scaling features without having the -1 affect its scaling too much?
Upvotes: 3
Views: 813
Reputation: 7170
I know that you got the answer from the comments above, but in an effort to show new scikit-learn users how you might approach a problem like this, I've put together a very rudimentary solution that demonstrates how to build a custom transformer that would handle this:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.utils.validation import check_array, check_is_fitted
import numpy as np
class NanImputeScaler(BaseEstimator, TransformerMixin):
"""Scale an array with missing values, then impute them
with a dummy value. This prevents the imputed value from impacting
the mean/standard deviation computation during scaling.
Parameters
----------
with_mean : bool, optional (default=True)
Whether to center the variables.
with_std : bool, optional (default=True)
Whether to divide by the standard deviation.
nan_level : int or float, optional (default=-99.)
The value to impute over NaN values after scaling the other features.
"""
def __init__(self, with_mean=True, with_std=True, nan_level=-99.):
self.with_mean = with_mean
self.with_std = with_std
self.nan_level = nan_level
def fit(self, X, y=None):
# Check the input array, but don't force everything to be finite.
# This also ensures the array is 2D
X = check_array(X, force_all_finite=False, ensure_2d=True)
# compute the statistics on the data irrespective of NaN values
self.means_ = np.nanmean(X, axis=0)
self.std_ = np.nanstd(X, axis=0)
return self
def transform(self, X):
# Check that we have already fit this transformer
check_is_fitted(self, "means_")
# get a copy of X so we can change it in place
X = check_array(X, force_all_finite=False, ensure_2d=True)
# center if needed
if self.with_mean:
X -= self.means_
# scale if needed
if self.with_std:
X /= self.std_
# now fill in the missing values
X[np.isnan(X)] = self.nan_level
return X
The way this works is by computing the nanmean
and nanstd
in the fit
section so that the NaN values are ignored while computing the statistics. Then, in the transform
section, after the variables are scaled and centered, the remaining NaN values are imputed with the value you designate (you mentioned -99, so that's what I defaulted to). You could always break that component of the transformer into another transformer, but I included it just for demonstration purposes.
Here we'll set up some data with NaNs present:
nan = np.nan
data = np.array([
[ 1., nan, 3.],
[ 2., 3., nan],
[nan, 4., 5.],
[ 4., 5., 6.]
])
And when we fit the scaler and examine the means/standard deviations, you can see that they did not account for the NaN values:
>>> imputer = NanImputeScaler().fit(data)
>>> imputer.means_
array([ 2.33333333, 4. , 4.66666667])
>>> imputer.std_
array([ 1.24721913, 0.81649658, 1.24721913])
Finally, when we transform the data, the data is scaled and the NaN values are handled:
>>> imputer.transform(data)
array([[ -1.06904497, -99. , -1.33630621],
[ -0.26726124, -1.22474487, -99. ],
[-99. , 0. , 0.26726124],
[ 1.33630621, 1.22474487, 1.06904497]])
You can even use this pattern inside of a scikit-learn pipeline (and even persist it to disk):
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
pipe = Pipeline([
("scale", NanImputeScaler()),
("clf", LogisticRegression())
]).fit(data, y)
Upvotes: 1