Anna P.
Anna P.

Reputation: 213

K-centers clustering using R - is the resulting plot off?

I am trying to do k-means clustering using R, and this is what I have done so far:

tmp <- kmeans(ds, centers = 4, iter.max = 1000) 

plot(ds[tmp$cluster==1,c(1,5)], col = "red", xlim = c(min(ds[,1]),  
  max(ds[,1])), ylim = c(min(ds[,5]), max(ds[,5])))
  points(ds[tmp$cluster==2,c(1,5)], col = "blue")
  points(ds[tmp$cluster==3,c(1,5)], col = "seagreen")
  points(ds[tmp$cluster==4,c(1,5)], col = "orange")
  points(tmp$centers[,c(1,5)], col = "black")

and I get the following graph:

enter image description here

I am quite new to this, so I may be way off, but this graph does not look quite right to me. The data is basically divided in zones and to be honest, I was expecting to see something along the lines of this:

enter image description here

The dataset I am using can be found here.

Upvotes: 2

Views: 423

Answers (2)

Samuel
Samuel

Reputation: 3053

This is how the k-means clustering algorithm work. Google "k-means clustering" and look at the picture results and you will see different variations: circular clusters and the type you received. If you set number of clusters k to a different number, you will get different clusters. The goal of the algorithm is to partition a data set into a desired number of non-overlapping clusters k, so that the total within-cluster variation is minimized. And this is the result you see in your plot.

Upvotes: 2

G5W
G5W

Reputation: 37631

Notice that Age runs from about 18 to 60, so the maximum distance between age is about 40. Now notice that the incomes range from 0 to 20000. The distance between points is heavily dominated by the income. If you wish both variables to be used in the clustering, you should scale the data before clustering. Try

tmp<-kmeans(scale(ds), centers = 4, iter.max = 1000) 

Upvotes: 2

Related Questions