Reputation: 1056
I'm working with kmeans
with in R, and using this code line to find centers of my data.
res=kmeans(data,centers=5)
I can reach my centers with this code:
res$centers
my first question is: are they members of my data or they exactly 5 centers of data?
If centers are my data points ,how can I reach the index of my centers?
If centers are not my data points, how can I find the closest data point to these centers?
Thank you
Algorithm url here
Upvotes: 3
Views: 2377
Reputation: 37879
No the centroids are not members of your data. They are randomly generated within the data set. It might happen that a centroid falls on a data point but that will be a coincidence and also the centroid will still be a separate point.
It cannot happen within the kmeans
function but it is easy to do on your own. See the following example:
library(stats)
x <- matrix(runif(3000),ncol=3 ) #create a 3-column matrix
mymod <- kmeans(x=x, centers=3) #run the kmeans model
x <- cbind(x,1:nrow(x)) #add index id (the row number) so that we can find the nearest data point later
#find nearest data point for the 1st cluster for this example
cluster1 <- data.frame(x[mymod$cluster==1,]) #convert to data.frame to work with dplyr
library(dplyr)
#calculate the euclidean distance between each data point in cluster 1 and the centroid 1
#store in column dist
cluster1 <- cluster1 %>% mutate(dist=sqrt( (cluster1[,1] - mymod$centers[1,1])^2 +
(cluster1[,2] - mymod$centers[1,2])^2 +
(cluster1[,3] - mymod$centers[1,3])^2 )
)
#nearest point to cluster 1
> cluster1[which.min(cluster1$dist), ]
X1 X2 X3 X4 dist
86 0.3801898 0.2592491 0.6675403 280 0.04266474
As it shows above the closest data point to center 1 is row 280 in the matrix x
You can do exactly the same for each center. If you have many centers then just write a function and use in lapply
.
Hope that helps!
P.S. Formula used to calculate euclidean distance is here
Upvotes: 4