seleucia
seleucia

Reputation: 1056

Finding centers' index using kmeans in R

I'm working with kmeans with in R, and using this code line to find centers of my data.

res=kmeans(data,centers=5)

I can reach my centers with this code:

res$centers

my first question is: are they members of my data or they exactly 5 centers of data?

If centers are my data points ,how can I reach the index of my centers?

If centers are not my data points, how can I find the closest data point to these centers?

Thank you

Algorithm url here

Upvotes: 3

Views: 2377

Answers (1)

LyzandeR
LyzandeR

Reputation: 37879

  1. First Question (are the centers part of my data?):

No the centroids are not members of your data. They are randomly generated within the data set. It might happen that a centroid falls on a data point but that will be a coincidence and also the centroid will still be a separate point.

  1. Second Question (How can I find the closest data point to my center?)

It cannot happen within the kmeans function but it is easy to do on your own. See the following example:

library(stats)
x <- matrix(runif(3000),ncol=3 ) #create a 3-column matrix
mymod <- kmeans(x=x, centers=3)  #run the kmeans model

x <- cbind(x,1:nrow(x)) #add index id (the row number) so that we can find the nearest data point later

#find nearest data point for the 1st cluster for this example
cluster1  <- data.frame(x[mymod$cluster==1,]) #convert to data.frame to work with dplyr


library(dplyr)

#calculate the euclidean distance between each data point in cluster 1 and the centroid 1
#store in column dist
cluster1 <- cluster1 %>% mutate(dist=sqrt(  (cluster1[,1] - mymod$centers[1,1])^2 +
                                            (cluster1[,2] - mymod$centers[1,2])^2 +
                                            (cluster1[,3] - mymod$centers[1,3])^2 ) 
                    )


#nearest point to cluster 1
> cluster1[which.min(cluster1$dist), ]
          X1        X2        X3  X4       dist
86 0.3801898 0.2592491 0.6675403 280 0.04266474

As it shows above the closest data point to center 1 is row 280 in the matrix x

You can do exactly the same for each center. If you have many centers then just write a function and use in lapply.

Hope that helps!

P.S. Formula used to calculate euclidean distance is here

Upvotes: 4

Related Questions