Reputation: 21
I am using the kmeans() function in the factoextra package in R. Everything is working great - I have the clusters for my data, but I am looking for a way to pull out only the datapoints that fall within a certain distance from the centroid of each cluster. Ideally, I'm looking for a way to subset the, say, closest 80% of points and ignore the points spread out more distant from the centroid.
Is there a simple way to do achieve this within the kmeans() function or factoextra package? Or will this require a more complicated approach?
I haven't found a parameter that seems to do this job. Perhaps I'm missing something right under my nose? So far, I am using this simple code:
library(factoextra)
km.res<-kmeans(data, 4, nstart=10)
fviz_cluster(km.res, data, ellipse.type="norm", geom="point")
Thanks!
Upvotes: 1
Views: 60
Reputation: 19088
With kmeans
, picking the number of clusters ncl, the centers are km$centers
. Taking 80% of the closest points by dividing the total number of entries by the number of clusters.
EDIT, if you have different cluster sizes we need to calculate the number of points per cluster
Plot
perc <- 0.8
plot(df_var, pch=19, lwd=2)
lapply(seq_along(km$centers[,"x"]), \(clst){
pck <- sum(km$cluster == clst) * perc
points(head(df_var[order(
sqrt((km$centers[clst,"x"] - df_var[,"x"])^2 +
(km$centers[clst,"y"] - df_var[,"y"])^2)),], pck),
col="green2", pch=20)})
points(km$centers, col="red", pch=19, lwd=3)
set.seed(1)
df_var <- data.frame(x = c(rnorm(100, 4, 1), rnorm(40, 9, 1)),
y = c(rnorm(100, 4, 1), rnorm(40, 9, 1)))
ncl <- 2
km <- kmeans(df_var, ncl)
Upvotes: 1
Reputation: 495
Yes, you can achieve it by using using factoextra
. It is not complex but it's always necessary to do when observing data when doing Kmeans.
km.res <- kmeans(data, 4, nstart = 10)
# Calculate distances
distances <- sqrt(rowSums((data - km.res$centers[km.res$cluster, ])^2))
# Combine distances with the data
data_with_distances <- cbind(data, cluster = km.res$cluster, distance = distances)
# Subset data within closest 80% for each cluster
subset_data <- data_with_distances[unlist(tapply(data_with_distances$distance, data_with_distances$cluster, function(x) x <= quantile(x, 0.80))), ]
Visualisation
As you can see, there are four clusters. Data points with the closest 80% to their respective centroids are highlighted circled
fviz_cluster(list(data = subset_data[, -c(ncol(subset_data)-1, ncol(subset_data))],
cluster = subset_data$cluster),
geom = "point", ellipse.type = "norm")
Upvotes: 1