Reputation: 549
I wanted to run a comparison of imputation values from the fancyimpute package using MICE, KNN, and Soft Impute, however, when I ran my code, the KNN and SoftImpute only imputed 0 for my values compared to the more sensical values imputed by MICE.
imputed_numerical=train[['Age']].select_dtypes(include='number']).as_matrix()
Age_MICE=MICE().complete(imputed_numerical)
Age_KNN=KNN(k=3).complete(imputed_numerical)
Age_SoftImpute=SoftImpute().complete(imputed_numerical)
I put the results in a dataframe which looks like this:
Not_Imputed MICE KNN SoftImpute
22.0 [22.0] [22.0] [22.0]
38.0 [38.0] [38.0] [38.0]
26.0 [26.0] [26.0] [26.0]
35.0 [35.0] [35.0] [35.0]
35.0 [35.0] [35.0] [35.0]
NaN [29] [0.0] [0.0]
54.0 [54.0] [54.0] [54.0]
2.0 [2.0] [2.0] [2.0]
27.0 [27.0] [27.0] [27.0]
14.0 [14.0] [14.0] [14.0]
4.0 [4.0] [4.0] [4.0]
58.0 [58.0] [58.0] [58.0]
20.0 [20.0] [20.0] [20.0]
39.0 [39.0] [39.0] [39.0]
14.0 [14.0] [14.0] [14.0]
55.0 [55.0] [55.0] [55.0]
2.0 [2.0] [2.0] [2.0]
NaN [27.6] [0.0] [0.0]
31.0 [31.0] [31.0] [31.0]
NaN [30] [0.0] [0.0]
Question: Why are KNN and SoftImpute only imputing 0 as the completed value?
Upvotes: 1
Views: 1075
Reputation: 970
The problem is that these are multivariate procedures, but you are only using one variable (column). MICE performs a multivariate regression, KNN takes the average of N neighbors, which are closest to the missing value in a multidimensional space (each dimension is a variable), and I'm not sure about softImpute but it is likely a multivariate procedure as well.
For example, see this warning message from the knn procedure:
[KNN] Warning: 3/20 still missing after imputation, replacing with 0
or this warning from SoftImpute:
RuntimeWarning: invalid value encountered in double_scalars
return (np.sqrt(ssd) / old_norm) < self.convergence_threshold
Upvotes: 0