Reputation: 47
I am trying to find a method to cluster univariate data by group. For example, in the data below I have two failure codes (a and b) and 6 data points for each grouping. In the plot you can see that for each failure code there are 2 distinct clusters for failure time. Manually this isn't bad, but I can't figure out how to do this with a larger data set (~100K rows and ~30 codes). I would like for the end result to give me the medoid for each cluster and the count of codes in that cluster.
library(ggplot2)
failure <- rep(c("a","b"),each=6)
ttf <- c(1,1.5,2,5,5.5,6,8,8.5,9,14,14.5,15)
data <- data.frame(failure,ttf)
qplot(failure, ttf)
results <- data.frame(failure = c("a","b"), m1 = c(1.5,8.5), m2 = c(5.5,14.5))
I would like for the end result to give me something like the table below.
failure m1 m1count m2 m2count
a 1.5 3 5.5 3
b 8.5 3 14.5 3
Upvotes: 0
Views: 953
Reputation: 5966
This is will do what you want, assuming only two clusters per failure group, though you could change it in the tapply
it would apply to all failure groups.
res2 <- tapply(data$ttf, INDEX = data$failure, function(x) kmeans(x,2))
res3 <- lapply(names(res2), function(x) data.frame(failure=x, Centers=res2[[x]]$centers, Size=res2[[x]]$size))
res3 <- do.call(rbind, res3)
res3
failure Centers Size
1 a 5.5 3
2 a 1.5 3
11 b 14.5 3
21 b 8.5 3
Upvotes: 1