OneHotEncoder raising NaN issue after SimpleImputer has been called already

Question

I have trouble understanding how pipelines are supposed to work in Sklearn. Following is an example using the titanic dataset.

data = pd.read_csv('datasets/train.csv')

cat_attribs = ["Embarked", "Cabin", "Ticket", "Name"]

num_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy="median")),
    ])


str_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy="most_frequent")),
    ])


full_pipeline = ColumnTransformer([
        ("num", num_pipeline, ["Pclass", "Age", "SibSp", "Parch", "Fare"]),
        ("str", str_pipeline, ["Cabin", "Sex"]),
        ("cat", OneHotEncoder(), ["Cabin"]),
    ])

full_pipeline.fit_transform(data)

I'd expect this to fill all missing NaN values (both in numeric and string) attributes, and then finally transform the Cabin attribute into a numerical one.

Instead the code ends up with the following error:

ValueError: Input contains NaN. If I remove the line calling the OneHotEncoder and printing the transformed array, there is no NaN value.

Hence I wonder. How am I supposed to call OneHotEncoder in this situation.

OneHotEncoder raising NaN issue after SimpleImputer has been called already

Answers (1)

Related Questions