How to find the clusters that produce the maximum colMeans in R?

Question

I have a data frame like

and I have a vector of the same length as the number of rows in the data frame (it's the cluster from kmeans, if that matters)

[1] 2 2 1...

From those I can get the colMeans for each cluster, like

cm1 <- colMeans(df[fit$cluster==1,])
cm2 <- colMeans(df[fit$cluster==2,])

(I don't think I should do that part explicitly, but that's how I'm thinking about the problem.)

What I want is to get, for each column of the data frame, the value from the vector for which the colMeans is the maximum. Also I'd like to do (separately is fine) the second-highest, third, etc. So in the example I would want the output to be a vector with one element for each column of the data frame:

1 2 1...

because for the first column of the data frame, the column mean for the first cluster is 3, while the column mean for the second cluster is 0.5.

akrun · Accepted Answer

If the cluster vector is of the same length as the number of rows of 'df', split the data by the 'cluster' column into a list,

lst1 <- lapply(split(df, fit$cluster), function(x) stack(colMeans(x)))
dat <- do.call(rbind, Map(cbind, cluster = names(lst1), lst1)) 
aggregate(values ~ ind, dat, FUN = which.max)

If we need to subset multiple element based on column means, create the 'cluster' column in the data, reshape to 'long' format (or use summarise/across), grouped by 'cluster', 'name', get the mean of 'value', arrange the column 'name' and the 'value' in descending order, then return the n rows with slice_head

library(dplyr)
library(tidyr)
df %>% 
   mutate(cluster = fit$cluster) %>% 
   pivot_longer(cols = -cluster) %>%
   group_by(cluster, name) %>%
   summarise(value = mean(value), .groups = 'drop') %>% 
   arrange(name, desc(value)) %>% 
   group_by(name) %>%
   slice_head(n = 2)

data

df <- structure(list(V1 = c(1L, 0L, 3L), V2 = c(1L, 1L, 0L), V3 = c(2L, 
0L, 3L)), class = "data.frame", row.names = c("1", "2", "3"))

fit <- structure(list(cluster = c(2, 2, 1)), class = "data.frame", 
  row.names = c(NA, 
-3L))

How to find the clusters that produce the maximum colMeans in R?

Answers (1)

data

Related Questions