Kelsey
Kelsey

Reputation: 89

How to inverse transform regression predictions after pipeline?

I'm trying to figure out how to unscale my data (presumably using inverse_transform) for predictions when I'm using a pipeline. The data below is just an example. My actual data is much larger and complicated, but I'm looking to use RobustScaler (as my data has outliers) and Lasso (as my data has dozens of useless features). I am new to pipelines in general.

Basically, if I try to use this model to predict anything, I want that prediction in unscaled terms. Is this possible with a pipeline? How can I do this with inverse_transform?

import pandas as pd
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import RobustScaler

data = [[100, 1, 50],[500 , 3, 25],[1000 , 10, 100]]
df = pd.DataFrame(data,columns=['Cost','People', 'Supplies'])

X = df[['People', 'Supplies']]
y = df[['Cost']]

#Split
X_train,X_test,y_train,y_test = train_test_split(X,y)

#Pipeline
pipeline = Pipeline([('scale', RobustScaler()),
            ('alg', Lasso())])

clf = pipeline.fit(X_train,y_train)

train_score = clf.score(X_train,y_train)
test_score = clf.score(X_test,y_test)

print ("training score:", train_score)
print ("test score:", test_score)

#Predict example 
example = [[10,100]]
clf.predict(example)

Upvotes: 6

Views: 7645

Answers (1)

Jake Drew
Jake Drew

Reputation: 2330

Simple Explanation

Your pipeline is only transforming the values in X, not y. The differences you are seeing in y for predictions are related to the differences in the coefficient values between two models fitted using scaled vs. unscaled data.

So, if you "want that prediction in unscaled terms" then take the scaler out of your pipeline. If you want that prediction in scaled terms you need scale the new prediction data prior to passing it to the .predict() function. The Pipeline actually does this for you automatically if you have included a scaler step in it.

Scaling and Regression

The practical purpose of scaling here would be when people and supplies have different dynamic ranges. Using the RobustScaler() removes the median and scales the data according to the quantile range. Typically you would only do this if you thought that your people or supply data has outliers that would influence the sample mean / variance in a negative way. If this is not the case, you would likely use the StandardScaler() to remove the mean and scale to unit variance.

Once the data is scaled, you can compare the regression coefficients to better to understand how the model is making its predictions. This is important since the coefficients for unscaled data may be very misleading.

An Example Using Your Code

The following example shows:

  1. Predictions using both scaled and unscaled data with and without the pipeline.

  2. The predictions match in both cases.

  3. You can see what the pipeline is doing in the background by looking at the non-pipeline examples.

  4. I have also included the model coefficients in both cases. Note that the coefficients or weights for the scaled vs. unscaled fitted models are very different.

  5. These coefficients are used to generate each prediction value for the variable example.

    import pandas as pd
    from sklearn.linear_model import Lasso
    from sklearn.model_selection import train_test_split
    from sklearn.pipeline import Pipeline
    from sklearn.preprocessing import RobustScaler
    
    data = [[100, 1, 50],[500 , 3, 25],[1000 , 10, 100]]
    df = pd.DataFrame(data,columns=['Cost','People', 'Supplies'])
    
    X = df[['People', 'Supplies']]
    y = df[['Cost']]
    
    #Split
    X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=0)
    
    #Pipeline
    pipeline_scaled = Pipeline([('scale', RobustScaler()),
                         ('alg', Lasso(random_state=0))])
    
    pipeline_unscaled = Pipeline([('alg', Lasso(random_state=0))])
    
    clf1 = pipeline_scaled.fit(X_train,y_train)
    clf2 = pipeline_unscaled.fit(X_train,y_train)
    
    
    #Pipeline predict example 
    example = [[10,100]]
    print('Pipe Scaled: ', clf1.predict(example))
    print('Pipe Unscaled: ',clf2.predict(example))
    
    #------------------------------------------------
    
    rs = RobustScaler()
    reg = Lasso(random_state=0)
    # Scale the taining data 
    X_train_scaled = rs.fit_transform(X_train)
    reg.fit(X_train_scaled, y_train)
    # Scale the example
    example_scaled = rs.transform(example)
    # Predict using the scaled data
    print('----------------------')
    print('Reg Scaled: ', reg.predict(example_scaled))
    print('Scaled Coefficents:',reg.coef_)
    
    #------------------------------------------------
    reg.fit(X_train, y_train)
    print('Reg Unscaled: ', reg.predict(example))
    print('Unscaled Coefficents:',reg.coef_)
    

Outputs:

Pipe Scaled:  [1892.]
Pipe Unscaled:  [-699.6]
----------------------
Reg Scaled:  [1892.]
Scaled Coefficents: [199.  -0.]
Reg Unscaled:  [-699.6]
Unscaled Coefficents: [  0.     -15.9936]

For Completeness

You original question asks about "unscaling" your data. I don't think this is what you actually need, since the X_train is your unscaled data. Howver, the following example shows how you could do this as well using the scaler object from your pipeline.

#------------------------------------------------
pipeline_scaled['scale'].inverse_transform(X_train_scaled)

Output

array([[ 3., 25.],
       [ 1., 50.]])

Upvotes: 4

Related Questions