Reputation: 29
"Hi everyone. When I see them using the K Nearest Network to classify the groups. I don't know why they just use the preProcess to standardize the data. Here are the code"
preProc <- preProcess(UB2[3:12])
UBn <- predict(preProc, UB2)
set.seed(12)
UBKm <- kmeans(UBn[3:12], centers = 5, iter.max = 1000)
Upvotes: 1
Views: 250
Reputation: 46888
You use preProcess to scale and center your variables, basically to have them in the same range.
In situations where the columns have different ranges, if you apply kmeans directly, it will mainly form clusters that minimize the variance on columns that have higher values.
For example we simulate three clusters that can be separated on variables of different scales:
library(caret)
library(MASS)
library(rgl)
set.seed(111)
Sigma <- matrix(c(10,1,1,1,1,1,1,1),3,3)
X = rbind(mvrnorm(n=200,c(50,1,1), Sigma),
mvrnorm(n=200,c(20,5,1), Sigma),
mvrnorm(n=200,c(20,2.5,2.5), Sigma))
X = data.frame(X,cluster=factor(rep(1:3,each=200)))
plot3d(X[,1:3],col=factor(rep(1:3,each=200)))
Not that X1 is in the range of 0-60 while X2,X3 are around -1 to 10..
If we do kmeans without scaling:
clus = kmeans(X[,1:3],3)
COLS = heat.colors(3)
plot3d(X[,1:3],col=COLS[clus$cluster])
It primarily tries to split using X1, ignoring X2,X3 resulting on a split in the original cluster 1.
So we scale and cluster:
clus = kmeans(predict(preProcess(X[,1:3]),X[,1:3]),3)
COLS = heat.colors(3)
plot3d(X[,1:3],col=COLS[clus$cluster])
Upvotes: 2