Reputation: 1054
I have been experimenting with Scikit Learn's Pipeline class and the Iris dataset. A short summary of each section of my code is as follows:
df:
Id slengthCm sWidthCm pLengthCm PWidthCm Species
0 1 5.1 3.5 1.4 0.2 Iris-setosa
1 2 4.9 3.0 1.4 0.2 Iris-setosa
2 3 4.7 3.2 1.3 0.2 Iris-setosa
3 4 4.6 3.1 1.5 0.2 Iris-setosa
4 5 5.0 3.6 1.4 0.2 Iris-setosa
... ... ... ... ... ... ...
145 146 6.7 3.0 5.2 2.3 Iris-virginica
146 147 6.3 2.5 5.0 1.9 Iris-virginica
147 148 6.5 3.0 5.2 2.0 Iris-virginica
dtypes:
Id int64
SepalLengthCm float64
SepalWidthCm float64
PetalLengthCm float64
PetalWidthCm float64
Species object
dtype: object
pipeline elements:
class Debug(BaseEstimator, TransformerMixin):
def transform(self, X):
print(pd.DataFrame(X).head())
print(X.shape)
self.X = X
self.df = pd.DataFrame(self.X)
return X
def fit(self, X, y=None, **fit_params):
return self
pipeline = Pipeline(steps=[('one_hot_encoding', OneHotEncoder(sparse=False)),
('debug_1', Debug()),
('standard_scaler', StandardScaler(with_mean=False)),
('debug_2', Debug()),
('kmeans_clustering', KMeans())])
Now if I fit this pipeline then view the content of the first debug step:
pipeline.fit_transform(df.values)
pipeline.named_steps["debug_2"].df
It seems that the one_hot_encoding step has 0-1 encoded all the values of the df instead of only the Species (object type) column
Is there to make OHE inside a pipeline apply only on specified columns or categorical/object ones?
Upvotes: 0
Views: 268
Reputation: 12602
You're looking for the ColumnTransformer
, possibly with the helper make_column_selector
for the specification of which columns to give to each transformer. For example,
preproc = ColumnTransformer(
transformers=[
('num', StandardScaler(withmean=False), make_column_selector(dtype_include=np.number)),
('obj', OneHotEncoder(), make_column_selector(dtype_include=object)),
],
)
or, being more explicit about columns,
preproc = ColumnTransformer(
transformers=[
('num', StandardScaler(withmean=False), ["SepalLengthCm", "SepalWidthCm", "PetalLengthCm", "PetalWidthCm"]),
('obj', OneHotEncoder(), ["Species"]),
],
)
Then
pipeline = Pipeline(steps=[('preproc', preproc),
('debug', Debug()),
('kmeans_clustering', KMeans())])
Upvotes: 1