Reputation: 2436
Given a list:
foo <- list(c("a", "b", "d"), c("c", "b"), c("c"),
c("b", "d"), c("e", "f"), c("e", "g"))
what is an efficient way to get a list that contains the disjoint sets of its content?
Here I want to obtain:
[[1]]
[1] "a" "b" "c" "d"
[[2]]
[1] "e" "f" "g"
The solutions I have managed to come up with seemed overly complicated and slow (I'm working with a largish list (4000+ elements) that contain up to hundreds of elements).
Thanks!
Solutions benchmarking
Thank you all for your input. The igraph approach is really nice. I did some benchmarking of the proposed solutions and using igraph with @flodel suggestion is efficient. The example here (iGrp
) has 3170 elements.
> microbenchmark(igraph_method(iGrp), igraph_method2(iGrp), iterative_method(iGrp), times=10L)
## Unit: milliseconds
## expr min lq median uq max neval
## igraph_method(iGrp) 6892.8534 7140.0287 7229.5569 7396.2458 8044.9796 10
## igraph_method2(iGrp) 381.4555 391.2097 442.3282 472.5641 537.4885 10
## iterative_method(iGrp) 7118.7857 7272.9568 7595.9700 7675.2888 8485.4388 10
#### functions used
igraph_method <- function(lst) {
edg <- do.call("rbind", lapply(lst, function(x) {
if (length(x) > 1) t(combn(x, 2)) else NULL
}))
g <- graph.data.frame(edg)
split(V(g)$name, clusters(g)$membership)
}
igraph_method2 <- function(lst) {
edg <- do.call("rbind", lapply(lst, function(x) {
if (length(x) > 1) cbind(head(x, -1), tail(x, -1)) else NULL
}))
g <- graph.data.frame(edg)
split(V(g)$name, clusters(g)$membership)
}
iterative_method <- function(lst) {
Reduce(function(l, x) {
matches <- sapply(l, function(i) any(x %in% i))
if (any(matches)) {
combined <- unique(c(unlist(l[matches]), x))
l[matches] <- NULL # Delete old entries
l <- c(l, list(combined)) # Add combined entries
} else {
l <- c(l, list(x)) # New list entry
}
l
}, lst, init=list())
}
Upvotes: 4
Views: 2386
Reputation: 44309
One way to approach this sort of problem is to build a graph where nodes are the values in your list and edges are whether those values have appeared together. Then you're just asking for the connected components of that graph. The igraph
package in R makes this pretty easy. First, you'll want to build a data frame with the edges:
edges <- do.call(rbind, lapply(foo, function(x) {
if (length(x) > 1) cbind(head(x, -1), tail(x, -1)) else NULL
}))
edges
# [,1] [,2]
# [1,] "a" "b"
# [2,] "b" "d"
# [3,] "c" "b"
# [4,] "b" "d"
# [5,] "e" "f"
# [6,] "e" "g"
Then, you can build your graph from the edges and compute the connected components:
library(igraph)
g <- graph.data.frame(edges, directed=FALSE)
split(V(g)$name, clusters(g)$membership)
# $`1`
# [1] "a" "b" "c" "d"
#
# $`2`
# [1] "e" "f" "g"
For reasonably large problems, this approach seems to be modestly faster than an iterative approach:
values = as.character(1:2000)
set.seed(144)
foo <- lapply(1:4000, function(x) sample(values, rbinom(1, 10, .5)))
library(microbenchmark)
microbenchmark(josilber(foo), lundberg(foo))
# Unit: milliseconds
# expr min lq median uq max neval
# josilber(foo) 251.8007 281.0168 297.2446 314.6714 635.7916 100
# lundberg(foo) 640.0575 714.9658 761.3777 827.5415 1118.3517 100
Upvotes: 7
Reputation: 42639
Here is an iterative approach, building a list for the result, and combining elements as they are seen together:
Reduce(function(l, x) {
matches <- sapply(l, function(i) any(x %in% i))
if (any(matches)) {
combined <- unique(c(unlist(l[matches]), x))
l[matches] <- NULL # Delete old entries
l <- c(l, list(combined)) # Add combined entries
} else {
l <- c(l, list(x)) # New list entry
}
l
}, foo, init=list())
## [[1]]
## [1] "a" "b" "d" "c"
##
## [[2]]
## [1] "e" "f" "g"
Upvotes: 2