Reputation: 77
I have a simple classification problem, which I am trying to address through neural network using keras. There is numeric dataset, of size 26000 * 17.But the problem is that, there are a lot of missing values (null values) in the dataset. Data is quite sensitive, so neither I can ignore all rows containing null values nor replace the null values in the data with average, mean or any standard number. There is also constraint of not using KNN imputation to replace missing entries. What is the best way to handle such dataset?
Upvotes: 4
Views: 822
Reputation: 1
#handling missing numerical data
import numpy as np
from sklearn.impute import SimpleImputer
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
numerical_cols = list(np.where((X.dtypes == np.int64) | (X.dtypes
==
np.float64))[0])
imp_mean.fit(X.iloc[:,numerical_cols])
X.iloc[:,numerical_cols] =
imp_mean.transform(X.iloc[:,numerical_cols])
#handling missing string data
imp_mean = SimpleImputer(missing_values=np.nan,
strategy='most_frequent')
imp_mean.fit(X.iloc[:,string_cols])
X.iloc[:,string_cols] = imp_mean.transform(X.iloc[:,string_cols])
Upvotes: 0
Reputation: 101
The best way to replace the missing values in any sort of numeric dataset is KNN-Imputation, which replace the missing values by considering neighbor entries.
Upvotes: 0
Reputation: 79
I dont know how much your data is crucial. BTW there is no as such good way to handle missing values. Sure, you will have to handle it by finding mean or average or with any standard number(e.g 0). KNN imputation is considered best method but dont know why there is constraint of not using KNN imputation.
Upvotes: 1