Reputation: 352
I have trouble understanding how pipelines are supposed to work in Sklearn. Following is an example using the titanic dataset.
data = pd.read_csv('datasets/train.csv')
cat_attribs = ["Embarked", "Cabin", "Ticket", "Name"]
num_pipeline = Pipeline([
('imputer', SimpleImputer(strategy="median")),
])
str_pipeline = Pipeline([
('imputer', SimpleImputer(strategy="most_frequent")),
])
full_pipeline = ColumnTransformer([
("num", num_pipeline, ["Pclass", "Age", "SibSp", "Parch", "Fare"]),
("str", str_pipeline, ["Cabin", "Sex"]),
("cat", OneHotEncoder(), ["Cabin"]),
])
full_pipeline.fit_transform(data)
I'd expect this to fill all missing NaN
values (both in numeric and string) attributes, and then finally transform the Cabin
attribute into a numerical one.
Instead the code ends up with the following error:
ValueError: Input contains NaN. If I remove the line calling the OneHotEncoder and printing the transformed array, there is no NaN value.
Hence I wonder. How am I supposed to call OneHotEncoder
in this situation.
Upvotes: 3
Views: 2370
Reputation: 16966
I would recommend applying OneHotEncoder
to all categorical variables. Hence make that as a seperate pipeline.
As it's a single step process for numerical columns, you can use the ColumnTransformer
directly.
Try this!
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline, make_pipeline
cat_preprocess = make_pipeline(SimpleImputer(strategy="most_frequent"), OneHotEncoder())
ct = make_column_transformer([
("num", SimpleImputer(strategy="median"), ["Pclass", "Age", "SibSp", "Parch", "Fare"]),
("str", cat_preprocess, ["Cabin", "Sex"]),
])
pipeline = Pipeline([('preprocess', ct)])
Upvotes: 2