Mitesh
Mitesh

Reputation: 53

Imputer on some columns in a Dataframe

I am trying to use Imputer on a single column called "Age" to replace missing values. But, I get the error: "Expected 2D array, got 1D array instead:"

Following is my code

import pandas as pd
import numpy as np
from sklearn.preprocessing import Imputer

dataset = pd.read_csv("titanic_train.csv")

dataset.drop('Cabin', axis=1, inplace=True)
x = dataset.drop('Survived', axis=1)
y = dataset['Survived']

imputer = Imputer(missing_values="nan", strategy="mean", axis=1)
imputer = imputer.fit(x['Age'])
x['Age'] = imputer.transform(x['Age'])

Upvotes: 5

Views: 8442

Answers (3)

desertnaut
desertnaut

Reputation: 60319

Although @thesilkworkm beat me in the curb, it may be useful to know why exactly your own code doesn't work.

So, apart from the reshape issue, there are two more mistakes in your code; the first is that you erroneously ask for axis=1 in your imputer, while you should ask for axis=0 (which is the default value, and that's why it works when omitted completely, as in @thesilkworkm'a answer); from the docs:

axis : integer, optional (default=0)

The axis along which to impute.

  • If axis=0, then impute along columns.
  • If axis=1, then impute along rows.

The second mistake is your missing_values argument, which should be 'NaN', and not 'nan'; from the docs again:

missing_values : integer or “NaN”, optional (default=”NaN”)

The placeholder for the missing values. All occurrences of missing_values will be imputed. For missing values encoded as np.nan, use the string value “NaN”.

So, just for offering an alternative but equivalent solution (beyond the one already provided by @thesilkworm), you can also fit & transform in one line:

imp = Imputer(missing_values ="NaN",strategy = "mean",axis = 0)
x['Age'] = imp.fit_transform(x['Age'].reshape(-1,1))

Upvotes: 3

Bhaskar
Bhaskar

Reputation: 343

When you are fit tranforming it use reshape(-1,1). Because method is expecting a 2D array as input but you are giving 1D array.

Ex: x['Age']=imputer.transform(x['Age'].reshape(-1,1))

Upvotes: 0

sjw
sjw

Reputation: 6543

The Imputer is expecting a 2-dimensional array as input, even if one of those dimensions is of length 1. This can be achieved using np.reshape:

imputer = Imputer(missing_values='NaN', strategy='mean')
imputer.fit(x['Age'].values.reshape(-1, 1))
x['Age'] = imputer.transform(x['Age'].values.reshape(-1, 1))

That said, if you are not doing anything more complicated than filling in missing values with the mean, you might find it easier to skip the Imputer altogether and just use Pandas fillna instead:

x['Age'].fillna(x['Age'].mean(), inplace=True)

Upvotes: 10

Related Questions