confusedandconfused
confusedandconfused

Reputation: 3

Sci-kit learn imputing values incorrectly

I am using Scikit-learn to impute missing values for my data set, but looking at the largest values for one of my features in the data set it is clear that these missing values are being imputed incorrectly. First I use a pandas function to see the largest 10 values for a feature in my data set

 ofData = mergeData.iloc[:, 3]
 print ofData.nlargest(10)

The output of this is,

 124    4.0
 128    4.0
 146    4.0
 147    4.0
 177    4.0
 240    4.0
 253    4.0
 310    4.0
 360    4.0
 361    4.0

Which is correct I know this to be the max possible value for this feature. Then I impute the data with Scikit learn.

 imp = Imputer(missing_values='NaN', strategy='mean', axis=1)
 nData = imp.fit_transform(mergeData)
 nData = pd.DataFrame(nData)

Then I once again use pandas to see the largest 10 values for this feature.

 ofData = nData.iloc[:, 3]
 print ofData.nlargest(10)

Which outputs,

 1030    77.571129
 1056    67.804684
 1308    62.780544
 1212    61.902375
 927     61.207525
 870     60.592999
 1100    55.604145
 1722    55.308159
 1415    52.637559
 72      49.940297

These values are clearly not the mean of that feature since they are all larger than the maximum values from before imputation. I'm completely lost on what could be causing this and am worried it could be affecting the imputation of other features in my data set as well.

Upvotes: 0

Views: 55

Answers (1)

DYZ
DYZ

Reputation: 57033

Since you want to replace missing values in a column with the mean in the column, the axis must be 0 (which is the default value), not 1. Your code replaces missing values with the mean in a row.

Upvotes: 1

Related Questions