Nguyen Bui
Nguyen Bui

Reputation: 29

Which are the purposes of using preProcess from "caret" package in R code?

"Hi everyone. When I see them using the K Nearest Network to classify the groups. I don't know why they just use the preProcess to standardize the data. Here are the code"

preProc <-  preProcess(UB2[3:12])
UBn <- predict(preProc, UB2)
set.seed(12)
UBKm <- kmeans(UBn[3:12], centers = 5, iter.max = 1000)

Upvotes: 1

Views: 250

Answers (1)

StupidWolf
StupidWolf

Reputation: 46888

You use preProcess to scale and center your variables, basically to have them in the same range.

In situations where the columns have different ranges, if you apply kmeans directly, it will mainly form clusters that minimize the variance on columns that have higher values.

For example we simulate three clusters that can be separated on variables of different scales:

library(caret)
library(MASS)
library(rgl)
set.seed(111)

Sigma <- matrix(c(10,1,1,1,1,1,1,1),3,3)
X = rbind(mvrnorm(n=200,c(50,1,1), Sigma),
mvrnorm(n=200,c(20,5,1), Sigma),
mvrnorm(n=200,c(20,2.5,2.5), Sigma))
X = data.frame(X,cluster=factor(rep(1:3,each=200)))
plot3d(X[,1:3],col=factor(rep(1:3,each=200)))

enter image description here

Not that X1 is in the range of 0-60 while X2,X3 are around -1 to 10..

If we do kmeans without scaling:

clus = kmeans(X[,1:3],3)
COLS = heat.colors(3)
plot3d(X[,1:3],col=COLS[clus$cluster])

enter image description here

It primarily tries to split using X1, ignoring X2,X3 resulting on a split in the original cluster 1.

So we scale and cluster:

clus = kmeans(predict(preProcess(X[,1:3]),X[,1:3]),3)
COLS = heat.colors(3)
plot3d(X[,1:3],col=COLS[clus$cluster])

enter image description here

Upvotes: 2

Related Questions