Usama Waseem
Usama Waseem

Reputation: 35

How can I replace missing boolean values using python?

In my dataset, one of the columns is a boolean value, and there are missing values within the dataset and within other continuous variable columns which are successfully replaced with their mean. But the mean value can not be replaced for missing boolean. So how can I replace those values?

Note that the boolean is 1 or 0 in my dataset.

Below is the code for replacing continuous missing values:

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(x)
x = imputer.transform(x)

Thank You

Upvotes: 3

Views: 1558

Answers (2)

Antoine Dubuis
Antoine Dubuis

Reputation: 5304

You can treat this boolean variable as a categorical feature and then use a SimpleImputer with the most_frequent strategy instead of mean.

You can do as follow:

from sklearn.impute import SimpleImputer
import numpy as np

#Create sample data with nans
X = np.random.randint(2, size=100).reshape(1,-1).astype(float)
X[0,::4] = np.nan

SimpleImputer(strategy="most_frequent").fit_transform(X)

Upvotes: 1

Michael Fleicher Tal
Michael Fleicher Tal

Reputation: 35

there are several methods to attack this issue.

  1. if you can afford it (if you have enough data) exclude those lines
  2. replace those lines with the majority value (same as replacing with mean of continuous value)
  3. for time series - replace the cell with mean of x cells before and after and set a threshold which above it - the mean will become 0, else , the mean will become 0

Upvotes: 1

Related Questions