Ali H. Kudeir
Ali H. Kudeir

Reputation: 1054

Sklearn OneHotEncoding inside pipeline is converting all data types not only categorical/object ones

I have been experimenting with Scikit Learn's Pipeline class and the Iris dataset. A short summary of each section of my code is as follows:

df:

Id  slengthCm sWidthCm pLengthCm PWidthCm Species
0   1   5.1 3.5 1.4 0.2 Iris-setosa
1   2   4.9 3.0 1.4 0.2 Iris-setosa
2   3   4.7 3.2 1.3 0.2 Iris-setosa
3   4   4.6 3.1 1.5 0.2 Iris-setosa
4   5   5.0 3.6 1.4 0.2 Iris-setosa
... ... ... ... ... ... ...
145 146 6.7 3.0 5.2 2.3 Iris-virginica
146 147 6.3 2.5 5.0 1.9 Iris-virginica
147 148 6.5 3.0 5.2 2.0 Iris-virginica

dtypes:

Id                 int64
SepalLengthCm    float64
SepalWidthCm     float64
PetalLengthCm    float64
PetalWidthCm     float64
Species           object
dtype: object

pipeline elements:

class Debug(BaseEstimator, TransformerMixin):

    def transform(self, X):
        print(pd.DataFrame(X).head())
        print(X.shape)
        self.X = X
        self.df = pd.DataFrame(self.X)
        return X

    def fit(self, X, y=None, **fit_params):
        return self

pipeline = Pipeline(steps=[('one_hot_encoding',  OneHotEncoder(sparse=False)),
                           ('debug_1', Debug()),
                           ('standard_scaler',   StandardScaler(with_mean=False)),
                           ('debug_2', Debug()),
                           ('kmeans_clustering', KMeans())])

Now if I fit this pipeline then view the content of the first debug step:

pipeline.fit_transform(df.values)
pipeline.named_steps["debug_2"].df

It seems that the one_hot_encoding step has 0-1 encoded all the values of the df instead of only the Species (object type) column

Is there to make OHE inside a pipeline apply only on specified columns or categorical/object ones?

Upvotes: 0

Views: 268

Answers (1)

Ben Reiniger
Ben Reiniger

Reputation: 12602

You're looking for the ColumnTransformer, possibly with the helper make_column_selector for the specification of which columns to give to each transformer. For example,

preproc = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(withmean=False), make_column_selector(dtype_include=np.number)),
        ('obj', OneHotEncoder(), make_column_selector(dtype_include=object)),
    ],
)

or, being more explicit about columns,

preproc = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(withmean=False), ["SepalLengthCm", "SepalWidthCm", "PetalLengthCm", "PetalWidthCm"]),
        ('obj', OneHotEncoder(), ["Species"]),
    ],
)

Then

pipeline = Pipeline(steps=[('preproc', preproc),
                           ('debug', Debug()),
                           ('kmeans_clustering', KMeans())])

Upvotes: 1

Related Questions