Reputation: 481
Good day
I am using the preProcess() function from the caret function to scale my training data accordingly. I also have a test data set which I want to scale with the same mean and standard deviation from the training set. In this way, I am treating the test data as completely new / unseen data scaled according to what I observed based on the training data.
I have the following code:
train = training data
test = test data (want to treat this as unseen)
preprocess_values_train = preProcess(train, method = c("center", "scale"))
train.st = predict(preprocess_values_train, train)
test.st = predict(preprocess_values_train, test)
I thought that this would apply the training mean and standard deviation to the test data set, but it doesn’t. How would you edit this code to scale the test data based on the training data details? train.st is exactly what I need, but test.st is not.
Thanks, Aveshen
Upvotes: 1
Views: 1592
Reputation: 46978
It is scaled based on the mean and sd of your train:
library(caret)
df = data.frame(matrix(runif(2000),ncol=10))
train = df[1:100,]
test = df[101:200,]
preprocess_values_train = preProcess(train, method = c("center", "scale"))
train.st = predict(preprocess_values_train, train)
head(train.st)
X1 X2 X3 X4 X5 X6
1 1.3163365 -0.31011484 -1.2534994 1.448256135 -0.8130691 1.401194346
2 1.1156438 1.44669749 -1.3775943 -0.077657870 1.6383685 -0.004940122
3 0.3628558 0.05983967 -1.4853910 -0.233465895 0.7657059 1.173381343
4 -1.3851982 -0.78838468 1.3607501 -0.001212484 -0.3388031 -1.321384412
5 -1.0269737 -1.34665949 -1.2681398 1.507292935 0.4152667 1.337453028
6 0.6322652 0.31820145 0.3719918 1.619318256 -0.3721707 -0.955420716
X7 X8 X9 X10
1 0.5323608 0.09905265 -0.4302925 -1.3965973
2 0.8590394 -1.13310729 0.9641076 0.9685195
3 -0.7753370 -0.08805592 1.4285071 -1.2162778
4 1.1605200 0.44107850 -0.7273844 0.7803693
5 0.2324899 0.28557215 -0.2934569 1.5633815
6 -0.7492416 -0.18478112 1.1474105 -0.2717625
We can calculate it manually:
scaled_train = t(apply(train,1,function(i)(i-preprocess_values_train$mean)/preprocess_values_train$std))
And you can see, we get back the same values as predict(...):
all.equal(scaled_train,as.matrix(train.st))
[1] TRUE
Now we apply this to test:
test.st = predict(preprocess_values_train, test)
scaled_test = t(apply(test,1,function(i)(i-preprocess_values_train$mean)/preprocess_values_train$std))
all.equal(scaled_test,as.matrix(test.st))
[1] TRUE
Upvotes: 1