Reputation: 18810
Following code helps to understand number of optimal clusters.
set.seed(123)
# function to compute total within-cluster sum of square
wss <- function(k) {
kmeans(df, k, nstart = 10 )$tot.withinss
}
# Compute and plot wss for k = 1 to k = 15
k.values <- 1:15
# extract wss for 2-15 clusters
wss_values <- map_dbl(k.values, wss)
plot(k.values, wss_values,
type="b", pch = 19, frame = FALSE,
xlab="Number of clusters K",
ylab="Total within-clusters sum of squares")
Reference: https://uc-r.github.io/
Goal is to convert this to run in a shared memory with multiple cores so that it gets done fast. fviz_nbclust
tried using this method and its extremely slow.
Approach/Attempt:
First, create wss
method to be called in mclapply
parallel.wss <- function(i, k) {
set.seed(101)
kmeans(df, k, nstart=i)$tot.withinss
}
here i
is number of parallel starts , k
is actually k.values
which is number of cluster we need to try out to find the optimal.
k.values <- 1:15
kmean_results <- mclapply(c(25,25,25,25), k.values, FUN=parallel.wss)
but got following warning:
Warning message:
In mclapply(c(25, 25, 25, 25), k.values, FUN = parallel.wss) :
all scheduled cores encountered errors in user code
looking at the kmean_results
object:
head(kmean_results) [[1]] [1] "Error in kmeans(df, k, nstart = i) : \n must have same number of columns in 'x' and 'centers'\n" attr(,"class") [1] "try-error" attr(,"condition")
Upvotes: 0
Views: 412
Reputation: 11728
With foreach
, you can do
ncores <- parallel::detectCores(logical = FALSE)
cl <- parallel::makeCluster(ncores)
doParallel::registerDoParallel(cl)
library(foreach)
wss_values2 <- foreach(k = k.values, .combine = 'c') %dopar% {
kmeans(df, k, nstart = 10)$tot.withinss
}
parallel::stopCluster(cl)
If you wrap the kmeans
call in a function, you need to pass all the variables as arguments (df
and k
).
Upvotes: 1