Reputation: 3
I am using Scikit-learn to impute missing values for my data set, but looking at the largest values for one of my features in the data set it is clear that these missing values are being imputed incorrectly. First I use a pandas function to see the largest 10 values for a feature in my data set
ofData = mergeData.iloc[:, 3]
print ofData.nlargest(10)
The output of this is,
124 4.0
128 4.0
146 4.0
147 4.0
177 4.0
240 4.0
253 4.0
310 4.0
360 4.0
361 4.0
Which is correct I know this to be the max possible value for this feature. Then I impute the data with Scikit learn.
imp = Imputer(missing_values='NaN', strategy='mean', axis=1)
nData = imp.fit_transform(mergeData)
nData = pd.DataFrame(nData)
Then I once again use pandas to see the largest 10 values for this feature.
ofData = nData.iloc[:, 3]
print ofData.nlargest(10)
Which outputs,
1030 77.571129
1056 67.804684
1308 62.780544
1212 61.902375
927 61.207525
870 60.592999
1100 55.604145
1722 55.308159
1415 52.637559
72 49.940297
These values are clearly not the mean of that feature since they are all larger than the maximum values from before imputation. I'm completely lost on what could be causing this and am worried it could be affecting the imputation of other features in my data set as well.
Upvotes: 0
Views: 55
Reputation: 57033
Since you want to replace missing values in a column with the mean in the column, the axis must be 0 (which is the default value), not 1. Your code replaces missing values with the mean in a row.
Upvotes: 1