Reputation: 669
I had a hair-ripping-out event recently, in which after much pain, I found out that using the scale()
function on variables prevented me from using the predict
function. I was pretty flabbergasted that something as simple as centering a variable would fundamentally change its type. I'm not good at explaining this, so it's probably easier see what I mean just by running the code below.
df = data.frame(
a=runif(100,45,90),
b=runif(100,0,60),
y=runif(100,-30,60)
)
df$a.center=scale(df$a,scale=FALSE)
df$b.center=scale(df$b,scale=FALSE)
m<-lm(y ~ a.center + b.center, data=df)
predict_df = data.frame(
a.center=c(-10,10),
b.center=c(-5,5)
)
predict_df$predicted = predict(m,predict_df)
I get the error:
Error: variables ‘a.center’, ‘b.center’ were specified with different types from the fit
Compared that to this code, that doesn't use centered variables and works as it's supposed to:
m2<-lm(y ~ a + b, data=df)
predict_df2 = data.frame(
a=c(-10,10),
b=c(-5,5)
)
predict_df2$predicted = predict(m2,predict_df2)
I also noticed that when doing str(df)
that the centered variables have something called "attr" below them:
'data.frame': 100 obs. of 5 variables:
$ a : num 71.4 57.1 83.9 49 65 ...
$ b : num 54.56 16.76 52.43 34.11 2.43 ...
$ y : num -14.1 -20.8 31.3 -23 51.1 ...
$ a.center: num [1:100, 1] 2.51 -11.77 14.96 -19.89 -3.87 ...
..- attr(*, "scaled:center")= num 68.9
$ b.center: num [1:100, 1] 23.31 -14.49 21.18 2.86 -28.82 ...
..- attr(*, "scaled:center")= num 31.3
So my question is: What the heck is happening here? Should I just refrain from using the scale
function? Is there a simple fix to this, and what is the "attr" thing I see in str(df)
?
Upvotes: 3
Views: 1327
Reputation: 4993
I would continue to use scale, which gives you the following structured data frame (which includes two matrices generated by centering, the vignette mentions this)
'data.frame': 100 obs. of 5 variables:
$ a : num 86.1 76.1 75.3 55.3 53.1 ...
$ b : num 48.99 5.99 11.34 56.47 12.9 ...
$ y : num -20.65 8.21 -21.6 13.36 -27.32 ...
$ a.center: num [1:100, 1] 17.85 7.87 7.11 -12.93 -15.16 ...
..- attr(*, "scaled:center")= num 68.2
$ b.center: num [1:100, 1] 19.6 -23.4 -18 27.1 -16.5 ...
..- attr(*, "scaled:center")= num 29.4
Using as.vector
to convert is the way to go. Just convert them back after scaling.
only new step in process
df$a.center<-as.vector(df$a.center)
df$b.center<-as.vector(df$a.center)
Then your resulting data is once again in the structure you had hoped for:
str(df)
'data.frame': 100 obs. of 5 variables:
$ a : num 86.1 76.1 75.3 55.3 53.1 ...
$ b : num 48.99 5.99 11.34 56.47 12.9 ...
$ y : num -20.65 8.21 -21.6 13.36 -27.32 ...
$ a.center: num 17.85 7.87 7.11 -12.93 -15.16 ...
$ b.center: num 17.85 7.87 7.11 -12.93 -15.16 ...
Then run your linear model and predictions as usual, taken from your code directly above, with the following results:
predict_df
a.center b.center predicted
1 -10 -5 9.534243
2 10 5 16.399051
I would definitely continue to use scale if you are comfortable with choosing between the three methods for each (TRUE, FALSE & a numeric vector) listed in the vignette and know how to properly select what you need for your particular model.
The reason I suggest this is precisely because of the attr.
attr
is an attribute of the matrix which returned by running scale on a vector or frame. It is a way of saving information about the transformation without including it in the actual data frame. It is sort of like metadata about the transformed data.
In this case, the attribute is the mean of the column, after NA values are removed, and it is the value used to center your data. You can verify this by doing a mean calculation as follows:
mean(df$a)
[1] 68.23281
mean(df$b)
[1] 29.38355
If you also had chosen to scale, there would have been a second value for each, the standard deviation of the column after NA values are removed.
R has kindly made note of the centering and scaling values for you.
Depending on how you use your prediction and the scrutiny your work goes through, it is useful to have these values. Also, the mean and standard deviation are a great quick check to see if you are properly preparing your data prior to modeling.
Definitely worth the hassle of converting to a vector or data frame!
If you try this yourself, make sure you set a seed so you can repeat the conversions without losing values.
And consider renaming the data frame before using as.vector
so you can keep the original with the attributes in there for future use and run the linear model on the converted set.
Upvotes: 2
Reputation: 42090
Look at the class of each column of the data frame, and you'll see the problem:
> sapply(df, class)
a b y a.center b.center
"numeric" "numeric" "numeric" "matrix" "matrix"
It appears that scale
returns a matrix, and apparently the data frame is happy to accept a single-column matrix into one of its columns, but lm
does not consider a one-column matrix to be equivalent to a vector. So this is a kind of weird and unfortunate interaction between 3 edge cases. To fix it, either avoid using scale
:
df$a.center <- df$a - mean(df$a)
df$b.center <- df$b - mean(df$b)
or else explicitly convert the result back to a vector:
df$a.center <- as.vector(scale(df$a,scale=FALSE))
df$b.center <- as.vector(scale(df$b,scale=FALSE))
Alternatively, you can assign the resulting matrix from scale
back into columns of the data frame using 2-D matrix-indexing notation, which does the right thing:
df[,c("a.center", "b.center")] <- scale(df[,c("a", "b")], scale=FALSE)
After which you should see this:
> sapply(df, class)
a b y a.center b.center
"numeric" "numeric" "numeric" "numeric" "numeric"
and your call to predict
will succeed.
Upvotes: 4