Aveshen Pillay
Aveshen Pillay

Reputation: 481

Preprocessing of training and test data using caret

Good day

I am using the preProcess() function from the caret function to scale my training data accordingly. I also have a test data set which I want to scale with the same mean and standard deviation from the training set. In this way, I am treating the test data as completely new / unseen data scaled according to what I observed based on the training data.

I have the following code:

train = training data
test = test data (want to treat this as unseen)

preprocess_values_train = preProcess(train, method = c("center", "scale"))
train.st = predict(preprocess_values_train, train)

test.st = predict(preprocess_values_train, test)

I thought that this would apply the training mean and standard deviation to the test data set, but it doesn’t. How would you edit this code to scale the test data based on the training data details? train.st is exactly what I need, but test.st is not.

Thanks, Aveshen

Upvotes: 1

Views: 1592

Answers (1)

StupidWolf
StupidWolf

Reputation: 46978

It is scaled based on the mean and sd of your train:

library(caret)
df = data.frame(matrix(runif(2000),ncol=10))
train = df[1:100,]
test = df[101:200,]

preprocess_values_train = preProcess(train, method = c("center", "scale"))
train.st = predict(preprocess_values_train, train)

head(train.st)
          X1          X2         X3           X4         X5           X6
1  1.3163365 -0.31011484 -1.2534994  1.448256135 -0.8130691  1.401194346
2  1.1156438  1.44669749 -1.3775943 -0.077657870  1.6383685 -0.004940122
3  0.3628558  0.05983967 -1.4853910 -0.233465895  0.7657059  1.173381343
4 -1.3851982 -0.78838468  1.3607501 -0.001212484 -0.3388031 -1.321384412
5 -1.0269737 -1.34665949 -1.2681398  1.507292935  0.4152667  1.337453028
6  0.6322652  0.31820145  0.3719918  1.619318256 -0.3721707 -0.955420716
          X7          X8         X9        X10
1  0.5323608  0.09905265 -0.4302925 -1.3965973
2  0.8590394 -1.13310729  0.9641076  0.9685195
3 -0.7753370 -0.08805592  1.4285071 -1.2162778
4  1.1605200  0.44107850 -0.7273844  0.7803693
5  0.2324899  0.28557215 -0.2934569  1.5633815
6 -0.7492416 -0.18478112  1.1474105 -0.2717625

We can calculate it manually:

scaled_train = t(apply(train,1,function(i)(i-preprocess_values_train$mean)/preprocess_values_train$std))

And you can see, we get back the same values as predict(...):

all.equal(scaled_train,as.matrix(train.st))
[1] TRUE

Now we apply this to test:

test.st = predict(preprocess_values_train, test)

scaled_test = t(apply(test,1,function(i)(i-preprocess_values_train$mean)/preprocess_values_train$std))

 all.equal(scaled_test,as.matrix(test.st))
[1] TRUE

Upvotes: 1

Related Questions