Centering variables in R prevents prediction?

Question

I had a hair-ripping-out event recently, in which after much pain, I found out that using the scale() function on variables prevented me from using the predict function. I was pretty flabbergasted that something as simple as centering a variable would fundamentally change its type. I'm not good at explaining this, so it's probably easier see what I mean just by running the code below.

df = data.frame(
  a=runif(100,45,90),
  b=runif(100,0,60),
  y=runif(100,-30,60)
)

df$a.center=scale(df$a,scale=FALSE)
df$b.center=scale(df$b,scale=FALSE)

m<-lm(y ~ a.center + b.center, data=df)

predict_df = data.frame(
  a.center=c(-10,10),
  b.center=c(-5,5)
)
predict_df$predicted = predict(m,predict_df)

I get the error:

Error: variables ‘a.center’, ‘b.center’ were specified with different types from the fit

Compared that to this code, that doesn't use centered variables and works as it's supposed to:

m2<-lm(y ~ a + b, data=df)
predict_df2 = data.frame(
  a=c(-10,10),
  b=c(-5,5)
)
predict_df2$predicted = predict(m2,predict_df2)

I also noticed that when doing str(df) that the centered variables have something called "attr" below them:

'data.frame':   100 obs. of  5 variables:
$ a       : num  71.4 57.1 83.9 49 65 ...
$ b       : num  54.56 16.76 52.43 34.11 2.43 ...
$ y       : num  -14.1 -20.8 31.3 -23 51.1 ...
$ a.center: num [1:100, 1] 2.51 -11.77 14.96 -19.89 -3.87 ...
..- attr(*, "scaled:center")= num 68.9
$ b.center: num [1:100, 1] 23.31 -14.49 21.18 2.86 -28.82 ...
..- attr(*, "scaled:center")= num 31.3

So my question is: What the heck is happening here? Should I just refrain from using the scale function? Is there a simple fix to this, and what is the "attr" thing I see in str(df)?

sconfluentus · Accepted Answer

I would continue to use scale, which gives you the following structured data frame (which includes two matrices generated by centering, the vignette mentions this)

'data.frame':   100 obs. of  5 variables:
$ a       : num  86.1 76.1 75.3 55.3 53.1 ...
$ b       : num  48.99 5.99 11.34 56.47 12.9 ...
$ y       : num  -20.65 8.21 -21.6 13.36 -27.32 ...
$ a.center: num [1:100, 1] 17.85 7.87 7.11 -12.93 -15.16 ...
 ..- attr(*, "scaled:center")= num 68.2
$ b.center: num [1:100, 1] 19.6 -23.4 -18 27.1 -16.5 ...
 ..- attr(*, "scaled:center")= num 29.4

Using as.vector to convert is the way to go. Just convert them back after scaling.

only new step in process

df$a.center<-as.vector(df$a.center)
df$b.center<-as.vector(df$a.center)

Then your resulting data is once again in the structure you had hoped for:

 str(df)
'data.frame':   100 obs. of  5 variables:
 $ a       : num  86.1 76.1 75.3 55.3 53.1 ...
 $ b       : num  48.99 5.99 11.34 56.47 12.9 ...
 $ y       : num  -20.65 8.21 -21.6 13.36 -27.32 ...
 $ a.center: num  17.85 7.87 7.11 -12.93 -15.16 ...
 $ b.center: num  17.85 7.87 7.11 -12.93 -15.16 ...

Then run your linear model and predictions as usual, taken from your code directly above, with the following results:

 predict_df
 a.center b.center predicted
 1      -10       -5  9.534243
 2       10        5 16.399051

I would definitely continue to use scale if you are comfortable with choosing between the three methods for each (TRUE, FALSE & a numeric vector) listed in the vignette and know how to properly select what you need for your particular model.

The reason I suggest this is precisely because of the attr.

attr is an attribute of the matrix which returned by running scale on a vector or frame. It is a way of saving information about the transformation without including it in the actual data frame. It is sort of like metadata about the transformed data.

In this case, the attribute is the mean of the column, after NA values are removed, and it is the value used to center your data. You can verify this by doing a mean calculation as follows:

mean(df$a)
[1] 68.23281

mean(df$b)
[1] 29.38355

If you also had chosen to scale, there would have been a second value for each, the standard deviation of the column after NA values are removed.

R has kindly made note of the centering and scaling values for you.

Depending on how you use your prediction and the scrutiny your work goes through, it is useful to have these values. Also, the mean and standard deviation are a great quick check to see if you are properly preparing your data prior to modeling.

Definitely worth the hassle of converting to a vector or data frame!

If you try this yourself, make sure you set a seed so you can repeat the conversions without losing values.

And consider renaming the data frame before using as.vector so you can keep the original with the attributes in there for future use and run the linear model on the converted set.

Centering variables in R prevents prediction?

Answers (2)

Related Questions