matt
matt

Reputation: 1984

How to find the clusters that produce the maximum colMeans in R?

I have a data frame like

  V1 V2 V3
1  1  1  2
2  0  1  0
3  3  0  3
....

and I have a vector of the same length as the number of rows in the data frame (it's the cluster from kmeans, if that matters)

[1] 2 2 1...

From those I can get the colMeans for each cluster, like

cm1 <- colMeans(df[fit$cluster==1,])
cm2 <- colMeans(df[fit$cluster==2,])

(I don't think I should do that part explicitly, but that's how I'm thinking about the problem.)

What I want is to get, for each column of the data frame, the value from the vector for which the colMeans is the maximum. Also I'd like to do (separately is fine) the second-highest, third, etc. So in the example I would want the output to be a vector with one element for each column of the data frame:

1 2 1...

because for the first column of the data frame, the column mean for the first cluster is 3, while the column mean for the second cluster is 0.5.

Upvotes: 1

Views: 166

Answers (1)

akrun
akrun

Reputation: 887038

If the cluster vector is of the same length as the number of rows of 'df', split the data by the 'cluster' column into a list,

lst1 <- lapply(split(df, fit$cluster), function(x) stack(colMeans(x)))
dat <- do.call(rbind, Map(cbind, cluster = names(lst1), lst1)) 
aggregate(values ~ ind, dat, FUN = which.max)

If we need to subset multiple element based on column means, create the 'cluster' column in the data, reshape to 'long' format (or use summarise/across), grouped by 'cluster', 'name', get the mean of 'value', arrange the column 'name' and the 'value' in descending order, then return the n rows with slice_head

library(dplyr)
library(tidyr)
df %>% 
   mutate(cluster = fit$cluster) %>% 
   pivot_longer(cols = -cluster) %>%
   group_by(cluster, name) %>%
   summarise(value = mean(value), .groups = 'drop') %>% 
   arrange(name, desc(value)) %>% 
   group_by(name) %>%
   slice_head(n = 2)

data

df <- structure(list(V1 = c(1L, 0L, 3L), V2 = c(1L, 1L, 0L), V3 = c(2L, 
0L, 3L)), class = "data.frame", row.names = c("1", "2", "3"))

fit <- structure(list(cluster = c(2, 2, 1)), class = "data.frame", 
  row.names = c(NA, 
-3L))

Upvotes: 1

Related Questions