Jaime Vera
Jaime Vera

Reputation: 91

How to implement inverse transformation in a pipeline of a ColumnTransformer?

I would like to understand how to apply inverse transformation in a pipeline, and not using the StandardScaler function directly.

The code that I am using is the following:

import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

categoric = X.select_dtypes(['object']).columns
numeric =   X.select_dtypes(['int']).columns

tf = ColumnTransformer([('onehot', OneHotEncoder(), categoric),
                        ('scaler', StandardScaler(), numeric)])

X_preprocessed = tf.fit_transform(X)

model = KMeans(n_clusters=2, random_state=24)
model.fit(X_preprocessed)

After getting the output of a given model (KMeans in this case), how can I get back the original scale of the numeric values of any X dataframe?

I know StandardScaler has a method (.inverse_transformation) to do that, but my question arises in the use of a pipeline with ColumnTransformer.

P.S.: The objective of doing so is to interpret the centroids of the model.

Upvotes: 8

Views: 5721

Answers (1)

Ruben Debien
Ruben Debien

Reputation: 61

You might have already found a solution, but I had a similar issue. I am working with pandas and would like the ColumnTransformer to return a dataframe again. I do this by placing the column names back in order as they are used in the columntransformer, but I wanted to make sure it was correct so I wanted to inverse the transformation and check if it returned the original dataframe and thus hadn't mislabeled any columns.

There are 2 ways to access the sub-transformers inside your tf:

tf.transformers_[1][1] # second transformer, 2nd item being the actual class
tf.named_transformers_['scaler']

You can then call the inverse_transform for that particular sub-transformer. This only gives you the ability to do the inverse with one of the transformers so you'd have to then reconstruct your dataset by appending the results of both into 1 frame again.

Upvotes: 2

Related Questions