Reputation: 225
For personal knowledge, I've been trying out different imputation methods other than the mean/median/mode. I was able to try out KNN, MICE, median imputational methods so far. I was told that imputation by clustering method can also be done and my internet search to find a package that does it came up with just research papers.
I'm running these imputational methods on Iris dataset by delibrately creating missing values in it (since Iris has no missing values). My approach for other methods is as follows:
data = pd.read_csv("D:/Iris_classification/train.csv")
#Shuffle the data and reset the index
from sklearn.utils import shuffle
data = shuffle(data).reset_index(drop = True)
#Create Independent and dependent matrices
X = data.iloc[:, [0, 1, 2, 3]].values
y = data.iloc[:, 4].values
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 50, random_state = 0)
#Standardize the data
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
#Impute missing values at random
prop = int(X_train.size * 0.5) #Set the % of values to be replaced
prop1 = int(X_test.size * 0.5)
a = [random.choice(range(X_train.shape[0])) for _ in range(prop)] #Randomly choose indices of the numpy array
b = [random.choice(range(X_train.shape[1])) for _ in range(prop)]
X1_train[a, b] = np.NaN
X1_test[c, d] = np.NaN
And then for KNN imputation, I've done
X_train_filled = KNN(3).complete(X_train)
X_test_filled = KNN(3).complete(X_test
Is there a way to impute missing values by clustering method? Also, StandardScaler() doesn't work when there are NaN values in it. Are there any other methods to standardize the data?
Upvotes: 6
Views: 3271
Reputation: 970
Have you looked at the fancyimpute package It offers KNN, MICE, Matrix Factorization, and a few others.
There is also impyute, which I haven't personally used but a presenter at SciPy told me he used it when fancyimpute wouldn't compile. It appears to have much better documentation than fancyimpute, although a few less options.
Other than that, there are not a ton of great imputation libraries in Python. This is one area where R really shines over Python, with excellent imputation packages like Amelia and MICE.
Upvotes: 0
Reputation: 48437
The main problem that we have to deal with is the case where you have some missing data.
First of all, I need tell you that removing "problem" lines could be quite dangerous because they can contains crucial information.
Is there a way to impute missing values by clustering?
Yes, you can replace the missing data by the mean of all the values in the column.
You can do this using Inputer
class from sklearn.preprocessing
from sklearn.preprocessing import Imputer
inputer = Inputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
inputer =
X = inputer.transform(X)
You have to use this method right after "Create Independent and dependent matrices" , before scaling and others.
I created below a simple example for you in order to show you how it works:
Upvotes: 1