bgenomics
bgenomics

Reputation: 21

Subset the closest points to centoid of cluster using factoextra in R

I am using the kmeans() function in the factoextra package in R. Everything is working great - I have the clusters for my data, but I am looking for a way to pull out only the datapoints that fall within a certain distance from the centroid of each cluster. Ideally, I'm looking for a way to subset the, say, closest 80% of points and ignore the points spread out more distant from the centroid.

Is there a simple way to do achieve this within the kmeans() function or factoextra package? Or will this require a more complicated approach?

I haven't found a parameter that seems to do this job. Perhaps I'm missing something right under my nose? So far, I am using this simple code:

library(factoextra)

km.res<-kmeans(data, 4, nstart=10)
fviz_cluster(km.res, data, ellipse.type="norm", geom="point")

Thanks!

Upvotes: 1

Views: 60

Answers (2)

Andre Wildberg
Andre Wildberg

Reputation: 19088

With kmeans, picking the number of clusters ncl, the centers are km$centers. Taking 80% of the closest points by dividing the total number of entries by the number of clusters.

EDIT, if you have different cluster sizes we need to calculate the number of points per cluster

Plot

perc <- 0.8

plot(df_var, pch=19, lwd=2)

lapply(seq_along(km$centers[,"x"]), \(clst){ 
  pck <- sum(km$cluster == clst) * perc
  points(head(df_var[order(
    sqrt((km$centers[clst,"x"] - df_var[,"x"])^2 + 
         (km$centers[clst,"y"] - df_var[,"y"])^2)),], pck), 
    col="green2", pch=20)})

points(km$centers, col="red", pch=19, lwd=3)

kmeans cluster scatter plot

Data

set.seed(1)

df_var <- data.frame(x = c(rnorm(100, 4, 1), rnorm(40, 9, 1)),
                     y = c(rnorm(100, 4, 1), rnorm(40, 9, 1)))

ncl <- 2

km <- kmeans(df_var, ncl)

Upvotes: 1

Qian
Qian

Reputation: 495

Yes, you can achieve it by using using factoextra. It is not complex but it's always necessary to do when observing data when doing Kmeans.

km.res <- kmeans(data, 4, nstart = 10)

# Calculate distances 
distances <- sqrt(rowSums((data - km.res$centers[km.res$cluster, ])^2))

# Combine distances with the data
data_with_distances <- cbind(data, cluster = km.res$cluster, distance = distances)

# Subset data within closest 80% for each cluster
subset_data <- data_with_distances[unlist(tapply(data_with_distances$distance, data_with_distances$cluster, function(x) x <= quantile(x, 0.80))), ]

Visualisation

As you can see, there are four clusters. Data points with the closest 80% to their respective centroids are highlighted circled

enter image description here

fviz_cluster(list(data = subset_data[, -c(ncol(subset_data)-1, ncol(subset_data))], 
                  cluster = subset_data$cluster), 
             geom = "point", ellipse.type = "norm")

Upvotes: 1

Related Questions