Caleb Wilkinson
Caleb Wilkinson

Reputation: 43

Sklearn inverse_transform return only one column when fit to many

Is there a way to inverse_transform one column with sklearn, when the initial transformer was fit on the whole data set? Below is an example of what I am trying to get after.

import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler

# Setting up a dummy pipeline
pipes = []
pipes.append(('scaler', MinMaxScaler()))
transformation_pipeline = Pipeline(pipes)

# Random data.
df = pd.DataFrame(
    {'data1': [1, 2, 3, 1, 2, 3],
     'data2': [1, 1, 1, 2, 2, 2],
     'Y': [1, 4, 1, 2, 2, 2]
    }
)

# Fitting the transformation pipeline
test = transformation_pipeline.fit_transform(df)

# Pulling the scaler function from the pipeline.
scaler = transformation_pipeline.named_steps['scaler']

# This is what I thought may work.
predicted_transformed = scaler.inverse_transform(test['Y'])

# The output would look something like this
# Essentially overlooking that scaler was fit on 3 variables and fitting
# the last one, or any I need.
predicted_transfromed = [1, 4, 1, 2, 2, 2]

I need to be able to fit the whole dataset as part of a data prep process. But then I am importing the scaler later into another instance with sklearn.externals joblibs. In this new instance the predicted values are the only thing that exists. So I need to extract just the inverse scaler for the Y column to get back the originals.

I am aware that I could fit one transformer for X variables and Y variables, However, I would like to avoid this. This method would add to the complexity of moving the scalers around and maintaining both of them in future projects.

Upvotes: 3

Views: 8466

Answers (3)

Launch9
Launch9

Reputation: 24

Improving on what Willem said. This will work with less input.

def invTransform(scaler, data):
    dummy = pd.DataFrame(np.zeros((len(data), scaler.n_features_in_)))
    dummy[0] = data
    dummy = pd.DataFrame(scaler.inverse_transform(dummy), columns=dummy.columns)
    return dummy[0].values

Upvotes: -1

Willem
Willem

Reputation: 1124

A bit late but I think this code does what you are looking for:

# - scaler   = the scaler object (it needs an inverse_transform method)
# - data     = the data to be inverse transformed as a Series, ndarray, ... 
#              (a 1d object you can assign to a df column)
# - ftName   = the name of the column to which the data belongs
# - colNames = all column names of the data on which scaler was fit 
#              (necessary because scaler will only accept a df of the same shape as the one it was fit on)
def invTransform(scaler, data, colName, colNames):
    dummy = pd.DataFrame(np.zeros((len(data), len(colNames))), columns=colNames)
    dummy[colName] = data
    dummy = pd.DataFrame(scaler.inverse_transform(dummy), columns=colNames)
    return dummy[colName].values

Note that you need to provide enough information to run use the inverse_transform method of the scaler object behind the scenes.

Upvotes: 5

user7400474
user7400474

Reputation: 41

Similar problems. I have a multidimensional timeseries as input (a quantity and 'exogenous' variables), and one dimension (a quantity) as output. I am unable to invert the scaling to compare the forecast to the original test set, since the scaler expects a multidimensional input.

One solution I can think of is using separate scalers for the quantity and the exogenous columns.

Another solution I can think of is to give the scaler sufficient 'junk' columns just to fill out the dimensions of the array to be unscaled, then only look at the first column of the output.

Then, once I forecast, I can invert the scaling on the forecast to get values which I can compare to the test set.

Upvotes: 0

Related Questions