Reputation: 89
I'm trying to figure out how to unscale my data (presumably using inverse_transform) for predictions when I'm using a pipeline. The data below is just an example. My actual data is much larger and complicated, but I'm looking to use RobustScaler (as my data has outliers) and Lasso (as my data has dozens of useless features). I am new to pipelines in general.
Basically, if I try to use this model to predict anything, I want that prediction in unscaled terms. Is this possible with a pipeline? How can I do this with inverse_transform?
import pandas as pd
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import RobustScaler
data = [[100, 1, 50],[500 , 3, 25],[1000 , 10, 100]]
df = pd.DataFrame(data,columns=['Cost','People', 'Supplies'])
X = df[['People', 'Supplies']]
y = df[['Cost']]
#Split
X_train,X_test,y_train,y_test = train_test_split(X,y)
#Pipeline
pipeline = Pipeline([('scale', RobustScaler()),
('alg', Lasso())])
clf = pipeline.fit(X_train,y_train)
train_score = clf.score(X_train,y_train)
test_score = clf.score(X_test,y_test)
print ("training score:", train_score)
print ("test score:", test_score)
#Predict example
example = [[10,100]]
clf.predict(example)
Upvotes: 6
Views: 7645
Reputation: 2330
Simple Explanation
Your pipeline is only transforming the values in X, not y. The differences you are seeing in y for predictions are related to the differences in the coefficient values between two models fitted using scaled vs. unscaled data.
So, if you "want that prediction in unscaled terms" then take the scaler out of your pipeline. If you want that prediction in scaled terms you need scale the new prediction data prior to passing it to the .predict() function. The Pipeline actually does this for you automatically if you have included a scaler step in it.
Scaling and Regression
The practical purpose of scaling here would be when people and supplies have different dynamic ranges. Using the RobustScaler() removes the median and scales the data according to the quantile range. Typically you would only do this if you thought that your people or supply data has outliers that would influence the sample mean / variance in a negative way. If this is not the case, you would likely use the StandardScaler() to remove the mean and scale to unit variance.
Once the data is scaled, you can compare the regression coefficients to better to understand how the model is making its predictions. This is important since the coefficients for unscaled data may be very misleading.
An Example Using Your Code
The following example shows:
Predictions using both scaled and unscaled data with and without the pipeline.
The predictions match in both cases.
You can see what the pipeline is doing in the background by looking at the non-pipeline examples.
I have also included the model coefficients in both cases. Note that the coefficients or weights for the scaled vs. unscaled fitted models are very different.
These coefficients are used to generate each prediction value for the variable example.
import pandas as pd from sklearn.linear_model import Lasso from sklearn.model_selection import train_test_split from sklearn.pipeline import Pipeline from sklearn.preprocessing import RobustScaler data = [[100, 1, 50],[500 , 3, 25],[1000 , 10, 100]] df = pd.DataFrame(data,columns=['Cost','People', 'Supplies']) X = df[['People', 'Supplies']] y = df[['Cost']] #Split X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=0) #Pipeline pipeline_scaled = Pipeline([('scale', RobustScaler()), ('alg', Lasso(random_state=0))]) pipeline_unscaled = Pipeline([('alg', Lasso(random_state=0))]) clf1 = pipeline_scaled.fit(X_train,y_train) clf2 = pipeline_unscaled.fit(X_train,y_train) #Pipeline predict example example = [[10,100]] print('Pipe Scaled: ', clf1.predict(example)) print('Pipe Unscaled: ',clf2.predict(example)) #------------------------------------------------ rs = RobustScaler() reg = Lasso(random_state=0) # Scale the taining data X_train_scaled = rs.fit_transform(X_train) reg.fit(X_train_scaled, y_train) # Scale the example example_scaled = rs.transform(example) # Predict using the scaled data print('----------------------') print('Reg Scaled: ', reg.predict(example_scaled)) print('Scaled Coefficents:',reg.coef_) #------------------------------------------------ reg.fit(X_train, y_train) print('Reg Unscaled: ', reg.predict(example)) print('Unscaled Coefficents:',reg.coef_)
Outputs:
Pipe Scaled: [1892.] Pipe Unscaled: [-699.6] ---------------------- Reg Scaled: [1892.] Scaled Coefficents: [199. -0.] Reg Unscaled: [-699.6] Unscaled Coefficents: [ 0. -15.9936]
For Completeness
You original question asks about "unscaling" your data. I don't think this is what you actually need, since the X_train is your unscaled data. Howver, the following example shows how you could do this as well using the scaler object from your pipeline.
#------------------------------------------------
pipeline_scaled['scale'].inverse_transform(X_train_scaled)
Output
array([[ 3., 25.], [ 1., 50.]])
Upvotes: 4