Ingo
Ingo

Reputation: 53

combn-like task for combinations as elements in R

not sure if I chose a good title at all... Also I don't know if I use the right terminology, so maybe with the right search terms I would find a solution to this problem...

I have a list of strings from which I want to have all sets of "exclusive" combinations of 3.

Example: With the following

require(utils)
mylist<-c("strA","strB","strC","strD","strE","strF")
t(combn(mylist,3))

I get a table that lists all possible combinations of 3 out of those 6 strings (so each row represents one combination of 3):

        [,1]   [,2]   [,3]  
   [1,] "strA" "strB" "strC"
   [2,] "strA" "strB" "strD"
   [3,] "strA" "strB" "strE"
   [4,] "strA" "strB" "strF"
   [5,] "strA" "strC" "strD"
   [6,] "strA" "strC" "strE"
   [7,] "strA" "strC" "strF"
   [8,] "strA" "strD" "strE"
   [9,] "strA" "strD" "strF"
  [10,] "strA" "strE" "strF"
  [11,] "strB" "strC" "strD"
  [12,] "strB" "strC" "strE"
  [13,] "strB" "strC" "strF"
  [14,] "strB" "strD" "strE"
  [15,] "strB" "strD" "strF"
  [16,] "strB" "strE" "strF"
  [17,] "strC" "strD" "strE"
  [18,] "strC" "strD" "strF"
  [19,] "strC" "strE" "strF"
  [20,] "strD" "strE" "strF"

but I would like to have all sets of combinations of 3, in which each string only appears once. So my desired output would look sth like this:

$1
        [,1]   [,2]   [,3]  
   [1,] "strA" "strB" "strC"
   [2,] "strD" "srtE" "strF"
$2
        [,1]   [,2]   [,3]  
   [1,] "strA" "strB" "strD"
   [1,] "strC" "strE" "strF"
$3
        [,1]   [,2]   [,3]  
   [1,] "strA" "strB" "strE"
   [1,] "strC" "strD" "strF"
...

So here each subelement ($1, $2, $3, etc.) contains 2 combinations of 3 strings (as 2*3=6; with 6 strings). In each set each string must not appear more than once.

Of course it would be great if this would be also possible for lengths of mylist that are not a multiple of n=3. In case we have let's say 10 strings (with the addition of "strG", "strH", "strI", and "strJ"), I would like to have one string ommitted in each combination. So the desired result would be sth like

$1
        [,1]   [,2]   [,3]  
   [1,] "strA" "strB" "strC"
   [2,] "strD" "strE" "strF"
   [3,] "strG" "strH" "strI"
$2
        [,1]   [,2]   [,3]  
   [1,] "strA" "strB" "strC"
   [2,] "strD" "strE" "strF"
   [3,] "strG" "strH" "strJ"
$3
        [,1]   [,2]   [,3]  
   [1,] "strA" "strB" "strC"
   [2,] "strD" "strE" "strF"
   [3,] "strG" "strI" "strJ"
$4
        [,1]   [,2]   [,3]  
   [1,] "strA" "strB" "strC"
   [2,] "strE" "strF" "strG"
   [3,] "strH" "strI" "strJ"
...

Does someone have a solution to this? If my explanation was unclear, please let me know.

Cheers

Upvotes: 2

Views: 132

Answers (2)

Ingo
Ingo

Reputation: 53

Based on 42's help (thanks again!) I figured out a way which is anything but elegant, but does the job (slowly...). But only because I could eliminate some of the possible combinations before doing the following steps it was feasible this way. In my original problem I have 49 strings, which would result in extremely big vectors, so be careful when applying the following steps to more than, say, 15 strings. There is surely a way to calculate how many combinations have to be processed...

Here comes the complete example

require(utils)
mylist<-paste("str",LETTERS[1:10],sep="")
mat<-as.data.frame(t(combn(mylist, 3, simplify = TRUE)))
mat[] <- lapply(mat, as.character)

mat.subset<-list()
for (i in seq(nrow(mat)))
{
  mat.temp<-mat
  j=1
  mat.subset[[i]]<-mat[i,]
  rem.row<-sort(unique(c(which(mat.temp[,1]%in%mat[i,1:3]),which(mat.temp[,2]%in%mat[i,1:3]),which(mat.temp[,3]%in%mat[i,1:3]))))
  mat.temp<-mat.temp[-rem.row,]
  while (j<=nrow(mat.temp))
  {
    if(!length(intersect(mat.temp[j,1:3],unlist(mat.subset[[i]]))))
    {
      mat.subset[[i]]<-rbind(mat.subset[[i]],mat.temp[j,])
      rem.row<-sort(unique(c(which(mat.temp[,1]%in%mat.temp[j,1:3]),which(mat.temp[,2]%in%mat[i,1:3]),which(mat.temp[,3]%in%mat[i,1:3]))))
      mat.temp<-mat.temp[-rem.row,]
    }
    j<-j+1
  }
}
mat.subset.lengths<-unlist(lapply(mat.subset,function(x) nrow(x)))
mat.subset<-mat.subset[which(mat.subset.lengths==max(mat.subset.lengths))]

The last two steps were necessary in my case as I had, as stated above, excluded some combinations before the time consuming for loop, and only a certain number of starting points would yield a complete solution (or in the worst case close to complete solution).

If you have a hint that there are more sets than covered with this procedure or if you have a more elgant way, I would appreciate your input.

Upvotes: 1

IRTFM
IRTFM

Reputation: 263342

Will assume the transposed combo matrix is named mat. Check to see if there is any overlap with length applied to results of intersect function:

 res <- list();
 for (i in 1:nrow(mat) ){
    for( j in 1:nrow(mat)){  
          if( !length(intersect(mat[i,] , mat[j,])) ) 
               res[[paste(i,j,sep="_")]] <- rbind( mat[i,], mat[j, ]) } }


> res
$`1_20`
     [,1]   [,2]   [,3]  
[1,] "strA" "strB" "strC"
[2,] "strD" "strE" "strF"

$`2_19`
     [,1]   [,2]   [,3]  
[1,] "strA" "strB" "strD"
[2,] "strC" "strE" "strF"

$`3_18`
     [,1]   [,2]   [,3]  
[1,] "strA" "strB" "strE"
[2,] "strC" "strD" "strF"

.... snipped

Depending on your definition of "unique" you might decide to only take the first ten items since half of these are transpositions of rows:

> res[[1]]
     [,1]   [,2]   [,3]  
[1,] "strA" "strB" "strC"
[2,] "strD" "strE" "strF"
> res[[20]]
     [,1]   [,2]   [,3]  
[1,] "strD" "strE" "strF"
[2,] "strA" "strB" "strC"

Upvotes: 1

Related Questions