Reputation: 470
I have the following data below. Notice the Age has Nan. My goal is to impute all columns properly.
+----+-------------+----------+--------+------+-------+-------+---------+
| ID | PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare |
+----+-------------+----------+--------+------+-------+-------+---------+
| 0 | 1 | 0 | 3 | 22.0 | 1 | 0 | 7.2500 |
| 1 | 2 | 1 | 1 | 38.0 | 1 | 0 | 71.2833 |
| 2 | 3 | 1 | 3 | 26.0 | 0 | 0 | 7.9250 |
| 3 | 4 | 1 | 1 | 35.0 | 1 | 0 | 53.1000 |
| 4 | 5 | 0 | 3 | 35.0 | 0 | 0 | 8.0500 |
| 5 | 6 | 0 | 3 | NaN | 0 | 0 | 8.4583 |
+----+-------------+----------+--------+------+-------+-------+---------+
I have a working code that imputes all columns. The results are below. The results looks problematic.
+----+-------------+----------+--------+-----------+-------+-------+---------+
| ID | PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare |
+----+-------------+----------+--------+-----------+-------+-------+---------+
| 0 | 1.0 | 0.0 | 3.0 | 22.000000 | 1.0 | 0.0 | 7.2500 |
| 1 | 2.0 | 1.0 | 1.0 | 38.000000 | 1.0 | 0.0 | 71.2833 |
| 2 | 3.0 | 1.0 | 3.0 | 26.000000 | 0.0 | 0.0 | 7.9250 |
| 3 | 4.0 | 1.0 | 1.0 | 35.000000 | 1.0 | 0.0 | 53.1000 |
| 4 | 5.0 | 0.0 | 3.0 | 35.000000 | 0.0 | 0.0 | 8.0500 |
| 5 | 6.0 | 0.0 | 3.0 | 2.909717 | 0.0 | 0.0 | 8.4583 |
+----+-------------+----------+--------+-----------+-------+-------+---------+
My code is below:
import pandas as pd
import numpy as np
#https://www.kaggle.com/shivamp629/traincsv/downloads/traincsv.zip/1
data = pd.read_csv("train.csv")
data2 = data[['PassengerId', 'Survived','Pclass','Age','SibSp','Parch','Fare']].copy()
from sklearn.preprocessing import Imputer
fill_NaN = Imputer(missing_values=np.nan, strategy='mean', axis=1)
data2_im = pd.DataFrame(fill_NaN.fit_transform(data2), columns = data2.columns)
data2_im
It's weird the age is 2.909717. Is there a proper way to do simple mean imputation. I am okay doing column by column but I am not clear with syntax/approach. Thanks for any help.
Upvotes: 1
Views: 116
Reputation: 25189
The root of your problem is this line:
fill_NaN = Imputer(missing_values=np.nan, strategy='mean', axis=1)
, which means you're averaging over rows (oranges and apples).
Try changing it to:
fill_NaN = Imputer(missing_values=np.nan, strategy='mean', axis=0) # axis=0
and you will have the expected behaviour.
strategy='median'
could be even better, as it's robust against outliers:
fill_NaN = Imputer(missing_values=np.nan, strategy='median', axis=0)
Upvotes: 1
Reputation: 67
Try like
fill_NaN = Imputer(missing_values=np.nan, strategy='mean', axis=0)
or
data2.fillna(data2.mean())
Upvotes: 1
Reputation: 765
The problem is that you use the wrong axis. The correct code should be:
fill_NaN = Imputer(missing_values=np.nan, strategy='mean', axis=0)
Note the axis=0
.
Upvotes: 1