Mohit Verma
Mohit Verma

Reputation: 2089

Scaling independent variables while predicting using linear regression model

I am trying to get a linear model where Y is dependent variable and X1, X2, X3 are my independent variables.

Have scaled my input using 'scale' method in R and got the eo-efficients and intercept.

Y = a1X1 + a2X2 + a3X3 + c

Now, to predict Y for given value of (X1, X2, X3), is it ok to directly compute value of Y using above equation or should the input variables be scaled before putting them in equation ? If yes, how can we scale them ?

Upvotes: 1

Views: 10662

Answers (2)

jlhoward
jlhoward

Reputation: 59415

If you have a training set (the original data), and a test set (the new data), and you build a model using the training set scaled to [0,1], then when you make predictions with this model using the test set, you have to scale that first as well. But be careful: you have to scale the test set using the same parameters as the training set. So if you use x-min(x)/(max(x)-min(x)) to scale, you must use the values of max(x) and min(x) from the training dataset. Here's an example:

set.seed(1)      # for reproducible example
train <- data.frame(X1=sample(1:100,100),
                 X2=1e6*sample(1:100,100),
                 X3=1e-6*sample(1:100,100))
train$y <- with(train,2*X1 + 3*1e-6*X2 - 5*1e6*X3 + 1 + rnorm(100,sd=10))

fit  <- lm(y~X1+X2+X3,train)
summary(fit)
# ...
# Coefficients:
#               Estimate Std. Error t value Pr(>|t|)    
# (Intercept)  1.063e+00  3.221e+00    0.33    0.742    
# X1           2.017e+00  3.698e-02   54.55   <2e-16 ***
# X2           2.974e-06  3.694e-08   80.51   <2e-16 ***
# X3          -4.988e+06  3.715e+04 -134.28   <2e-16 ***
# ---

# scale the predictor variables to [0,1]
mins   <- sapply(train[,1:3],min)
ranges <- sapply(train[,1:3],function(x)diff(range(x)))
train.scaled <- as.data.frame(scale(train[,1:3],center=mins,scale=ranges))
train.scaled$y <- train$y
fit.scaled <- lm(y ~ X1 + X2 + X3, train.scaled)
summary(fit.scaled)
# ...
# Coefficients:
#             Estimate Std. Error  t value Pr(>|t|)    
# (Intercept)    1.066      3.164    0.337    0.737    
# X1           199.731      3.661   54.553   <2e-16 ***
# X2           294.421      3.657   80.508   <2e-16 ***
# X3          -493.828      3.678 -134.275   <2e-16 ***
# ---

Note that, as expected, scaling affects the value of the coefficients (of course...), but not the t-values, or the se of the fit, or RSQ, or F (I've only reproduced part of the summaries here).

Now let's compare the effect of scaling with a test dataset.

# create test dataset
test <- data.frame(X1=sample(-5:5,10),
                      X2=1e6*sample(-5:5,10),
                      X3=1e-6*sample(-5:5,10))
# predict y based on test data with un-scaled fit
pred   <- predict(fit,newdata=test)

# scale the test data using min and range from training dataset
test.scaled <- as.data.frame(scale(test[,1:3],center=mins,scale=ranges))
# predict y based on new data scaled, with fit from scaled dataset
pred.scaled   <- predict(fit.scaled,newdata=test.scaled)

all.equal(pred,pred.scaled)
# [1] TRUE

So prediction using the un-scaled fit with un-scaled data yields exactly the same result as prediction using the scaled fit with scaled data.

Upvotes: 5

Gregor Thomas
Gregor Thomas

Reputation: 146050

is it ok to directly compute value of Y using above equation or should the input variables be scaled before putting them in equation

The input variables should be scaled in the same way as you did your initial scaling.

If yes, how can we scale them ?

Read the documentation for the command you used (?scale) and see what it did! Then replicate it for you new prediction data. If you used the defaults, it subtracted the means of your original predictors, then divided by the standard deviation. You should go back to the raw data, calculate the means and standard deviations, and use those to scale your data for prediction in the same way.

Transforming fitted coefficients

Your other option is to transform the coefficients. This just takes a little bit of algebra. If your scaling transformation is f(x) = mx + b, and your fitted model is y = a * f(x), it's easy to see that

y = a * f(x) + c
y = a * (mx + b) + c
y = a m x + a b + c

So, with untransformed data x your slope is a * m and your intercept is a * b + c. This is easily extended to more variables or a different transformation. If you're transforming to [0, 1], your transformation is probably f(x) = (x - min(x)) / (max(x) - min(x))... the algebra shouldn't be difficult, but I'll leave it to you.

Upvotes: 2

Related Questions