Reputation: 41
My problem: need to find all disjoint (non-overlapping) sets from a set of sets.
Background: I am using comparative phylogenetic methods to study trait evolution in birds. I have a tree with ~300 species. This tree can be divided into subclades (i.e. subtrees). If two subclades do not share species, they are independent. I'm looking for an algorithm (and an R implementation if possible) that will find all possible subclade partitions where each subclade has greater than 10 taxa and all are independent. Each subclade can be considered a set and when two subclades are independent (do not share species) these subclades are then disjoint sets.
Hope this is clear and someone can help.
Cheers, Glenn
The following code produces an example dataset. Where subclades is a list of all possible subclades (sets) from which I'd like to sample X disjoint sets, where the length of the set is Y.
###################################
# Example Dataset
###################################
library(ape)
library(phangorn)
library(TreeSim)
library(phytools)
##simulate a tree
n.taxa <- 300
tree <- sim.bd.taxa(n.taxa,1,lambda=.5,mu=0)[[1]][[1]]
tree$tip.label <- seq(n.taxa)
##extract all monophyletic subclades
get.all.subclades <- function(tree){
tmp <- vector("list")
nodes <- sort(unique(tree$edge[,1]))
i <- 282
for(i in 1:length(nodes)){
x <- Descendants(tree,nodes[i],type="tips")[[1]]
tmp[[i]] <- tree$tip.label[x]
}
tmp
}
tmp <- get.all.subclades(tree)
##set bounds on the maximum and mininum number of tips of the subclades to include
min.subclade.n.tip <- 10
max.subclade.n.tip <- 40
##function to replace trees of tip length exceeding max and min with NA
replace.trees <- function(x, min, max){
if(length(x) >= min & length(x)<= max) x else NA
}
#apply testNtip across all the subclades
tmp2 <- lapply(tmp, replace.trees, min = min.subclade.n.tip, max = max.subclade.n.tip)
##remove elements from list with NA,
##all remaining elements are subclades with number of tips between
##min.subclade.n.tip and max.subclade.n.tip
subclades <- tmp2[!is.na(tmp2)]
names(subclades) <- seq(length(subclades))
Upvotes: 4
Views: 5055
Reputation: 12819
Here are some functions that might be useful.
The first computes all possible disjoint collections of a list of sets.
I'm using "collection" instead of "partition" beacause a collection does not necessarily covers the universe (i. e., the union of all sets).
The algorithm is recursive, and only works for a small number of possible collections. This does not necessarily means that it won't work with a large list of sets, since the function removes the intersecting sets at every iteration.
If the code is not clear, please ask and I'll add comments.
The input must be a named list, and the result will be a list of collection, which is a character vector indicating the names of the sets.
DisjointCollectionsNotContainingX <- function(L, branch=character(0), x=numeric(0))
{
filter <- vapply(L, function(y) length(intersect(x, y))==0, logical(1))
L <- L[filter]
result <- list(branch)
for( i in seq_along(L) )
{
result <- c(result, Recall(L=L[-(1:i)], branch=c(branch, names(L)[i]), x=union(x, L[[i]])))
}
result
}
This is just a wrapper to hide auxiliary arguments:
DisjointCollections <- function(L) DisjointCollectionsNotContainingX(L=L)
The next function can be used to validade a given list of collections supposedly non-overlapping and "maximal".
For every collection, it will test if
1. all sets are effectively disjoint and
2. adding another set either results in a non-disjoint collection or an existing collection:
ValidateDC <- function(L, DC)
{
for( collection in DC )
{
for( i in seq_along(collection) )
{
others <- Reduce(f=union, x=L[collection[-i]])
if( length(intersect(L[collection[i]], others)) > 0 ) return(FALSE)
}
elements <- Reduce(f=union, x=L[collection])
for( k in seq_along(L) ) if( ! (names(L)[k] %in% collection) )
{
if( length(intersect(elements, L[[k]])) == 0 )
{
check <- vapply(DC, function(z) setequal(c(collection, names(L)[k]), z), logical(1))
if( ! any(check) ) return(FALSE)
}
}
}
TRUE
}
Example:
L <- list(A=c(1,2,3), B=c(3,4), C=c(5,6), D=c(6,7,8))
> ValidateDC(L,DisjointCollections(L))
[1] TRUE
> DisjointCollections(L)
[[1]]
character(0)
[[2]]
[1] "A"
[[3]]
[1] "A" "C"
[[4]]
[1] "A" "D"
[[5]]
[1] "B"
[[6]]
[1] "B" "C"
[[7]]
[1] "B" "D"
[[8]]
[1] "C"
[[9]]
[1] "D"
Note that the collections containing A
and B
simultaneously do not show up, due to their non-null intersection. Also, collections with C
and D
simultaneously don't appear. Others are OK.
Note: the empty collection character(0)
is always a valid combination.
After creating all possible disjoint collections, you can apply any filters you want to proceed.
EDIT:
I've removed the line if( length(L)==0 ) return(list(branch))
from the first function; it's not needed.
Performance: If there is considerable overlapping among sets, the function runs fast. Example:
set.seed(1)
L <- lapply(1:50, function(.)sample(x=1200, size=20))
names(L) <- c(LETTERS, letters)[1:50]
system.time(DC <- DisjointCollections(L))
Result:
# user system elapsed
# 9.91 0.00 9.92
Total number of collections found:
> length(DC)
[1] 121791
Upvotes: 1
Reputation: 162321
Here's an example of how you might test each pair of list elements for zero overlap, extracting the indices of all non-overlapping pairs.
findDisjointPairs <- function(X) {
## Form a 2-column matrix enumerating all pairwise combos of X's elements
ij <- t(combn(length(X),2))
## A function that tests for zero overlap between a pair of vectors
areDisjoint <- function(i, j) length(intersect(X[[i]], X[[j]])) == 0
## Use mapply to test for overlap between each pair and extract indices
## of pairs with no matches
ij[mapply(areDisjoint, ij[,1], ij[,2]),]
}
## Make some reproducible data and test the function on it
set.seed(1)
A <- replicate(sample(letters, 5), n=5, simplify=FALSE)
findDisjointPairs(A)
# [,1] [,2]
# [1,] 1 2
# [2,] 1 4
# [3,] 1 5
Upvotes: 2