fmic_
fmic_

Reputation: 2436

Get disjoint sets from a list in R

Given a list:

foo <- list(c("a", "b", "d"), c("c", "b"), c("c"),
            c("b", "d"), c("e", "f"), c("e", "g"))

what is an efficient way to get a list that contains the disjoint sets of its content?

Here I want to obtain:

[[1]]
[1] "a" "b" "c" "d"

[[2]]
[1] "e" "f" "g"

The solutions I have managed to come up with seemed overly complicated and slow (I'm working with a largish list (4000+ elements) that contain up to hundreds of elements).

Thanks!


Solutions benchmarking

Thank you all for your input. The igraph approach is really nice. I did some benchmarking of the proposed solutions and using igraph with @flodel suggestion is efficient. The example here (iGrp) has 3170 elements.

> microbenchmark(igraph_method(iGrp), igraph_method2(iGrp), iterative_method(iGrp), times=10L)
## Unit: milliseconds
##                    expr       min        lq    median        uq       max neval
##     igraph_method(iGrp) 6892.8534 7140.0287 7229.5569 7396.2458 8044.9796    10
##    igraph_method2(iGrp)  381.4555  391.2097  442.3282  472.5641  537.4885    10
##  iterative_method(iGrp) 7118.7857 7272.9568 7595.9700 7675.2888 8485.4388    10

#### functions used

igraph_method <- function(lst) {
    edg <- do.call("rbind", lapply(lst, function(x) {
        if (length(x) > 1) t(combn(x, 2)) else NULL
        }))
    g <- graph.data.frame(edg)
    split(V(g)$name, clusters(g)$membership)
}

igraph_method2 <- function(lst) {
    edg <- do.call("rbind", lapply(lst, function(x) {
        if (length(x) > 1) cbind(head(x, -1), tail(x, -1)) else NULL
    }))
    g <- graph.data.frame(edg)
    split(V(g)$name, clusters(g)$membership)
}

iterative_method <- function(lst) {
    Reduce(function(l, x)  {
        matches <- sapply(l, function(i) any(x %in% i))

        if (any(matches)) {
            combined <- unique(c(unlist(l[matches]), x))
            l[matches] <- NULL        # Delete old entries
            l <- c(l, list(combined)) # Add combined entries
        } else {
            l <- c(l, list(x))        # New list entry
        }
        l
    }, lst, init=list())
}

Upvotes: 4

Views: 2386

Answers (2)

josliber
josliber

Reputation: 44309

One way to approach this sort of problem is to build a graph where nodes are the values in your list and edges are whether those values have appeared together. Then you're just asking for the connected components of that graph. The igraph package in R makes this pretty easy. First, you'll want to build a data frame with the edges:

edges <- do.call(rbind, lapply(foo, function(x) {
  if (length(x) > 1) cbind(head(x, -1), tail(x, -1)) else NULL  
}))
edges
#      [,1] [,2]
# [1,] "a"  "b" 
# [2,] "b"  "d" 
# [3,] "c"  "b" 
# [4,] "b"  "d" 
# [5,] "e"  "f" 
# [6,] "e"  "g" 

Then, you can build your graph from the edges and compute the connected components:

library(igraph)
g <- graph.data.frame(edges, directed=FALSE)
split(V(g)$name, clusters(g)$membership)
# $`1`
# [1] "a" "b" "c" "d"
# 
# $`2`
# [1] "e" "f" "g"

For reasonably large problems, this approach seems to be modestly faster than an iterative approach:

values = as.character(1:2000)
set.seed(144)
foo <- lapply(1:4000, function(x) sample(values, rbinom(1, 10, .5)))
library(microbenchmark)
microbenchmark(josilber(foo), lundberg(foo))
# Unit: milliseconds
#           expr      min       lq   median       uq       max neval
#  josilber(foo) 251.8007 281.0168 297.2446 314.6714  635.7916   100
#  lundberg(foo) 640.0575 714.9658 761.3777 827.5415 1118.3517   100

Upvotes: 7

Matthew Lundberg
Matthew Lundberg

Reputation: 42639

Here is an iterative approach, building a list for the result, and combining elements as they are seen together:

Reduce(function(l, x) {
    matches <- sapply(l, function(i) any(x %in% i))

    if (any(matches)) {
        combined <- unique(c(unlist(l[matches]), x))
        l[matches] <- NULL        # Delete old entries
        l <- c(l, list(combined)) # Add combined entries
    } else {
        l <- c(l, list(x))        # New list entry
    }
    l
}, foo, init=list())
## [[1]]
## [1] "a" "b" "d" "c"
## 
## [[2]]
## [1] "e" "f" "g"

Upvotes: 2

Related Questions