Reputation: 99

remove vectors which are subsets of other vectors in a list

I have a list with vectors of different length, for example:

a=c(12345,12367,91670,87276,92865)
b=c(12345,87276,89250)
c=c(12367,91670)
d=c(23753,82575,91475,10957,92865,24311)
mylist=list(a,b,c,d)
mylist

# [[1]]
# [1] 12345 12367 91670 87276 92865
# 
# [[2]]
# [1] 12345 87276 89250
# 
# [[3]]
# [1] 12367 91670
# 
# [[4]]
# [1] 23753 82575 91475 10957 92865 24311

my question is how can I remove the vectors of this list which are a subset of another vector of the same list. i.e. in this case how can I remove the 3rd object of the list which is a subset of the 1st object??

Upvotes: 6

Answers (5)

alexis_laz

Reputation: 13122

An alternative (of the "list") format for the elements is a table of "elements-by-values":

table(values = unlist(mylist), 
      elt = rep(seq_along(mylist), lengths(mylist)))
#       elt
#values  1 2 3 4
#  10957 0 0 0 1
#  12345 1 1 0 0
#  12367 1 0 1 0
#  23753 0 0 0 1
#  24311 0 0 0 1
#  82575 0 0 0 1
#  87276 1 1 0 0
#  89250 0 1 0 0
#  91475 0 0 0 1
#  91670 1 0 1 0
#  92865 1 0 0 1

This can, very easy, consume much memory so we could chase a sparse alternative:

l = unlist(mylist)
ul = unique(l)
tab = sparseMatrix(x = TRUE, 
                   i = match(l, ul), 
                   j = rep(seq_along(mylist), lengths(mylist)), 
                   dimnames = list(ul, sprintf("elt_%d", seq_along(mylist))))

tab
#11 x 4 sparse Matrix of class "lgCMatrix"
#      elt_1 elt_2 elt_3 elt_4
#12345     |     |     .     .
#12367     |     .     |     .
#91670     |     .     |     .
#87276     |     |     .     .
#92865     |     .     .     |
#89250     .     |     .     .
#23753     .     .     .     |
#82575     .     .     .     |
#91475     .     .     .     |
#10957     .     .     .     |
#24311     .     .     .     |

Then, to find which element is a subset of which:

subsets = lengths(mylist) == crossprod(tab)

subsets
#4 x 4 sparse Matrix of class "lgCMatrix"
#      elt_1 elt_2 elt_3 elt_4
#elt_1     |     .     .     .
#elt_2     .     |     .     .
#elt_3     |     .     |     .
#elt_4     .     .     .     |

where, here, each element is a subset of ...itself and 3 is a subset of 1. To reach the wanted info we could use:

subset(summary(subsets), i != j)[c("i", "j")]

# 
#  i j
#2 3 1

Or, to avoid incorporating "diag" indices and subsetting them afterwards, we could manipulate the structure already present:

dp = diff(subsets@p)
j = rep(0:(length(dp) - 1), dp)
wh = subsets@i != j
cbind(subset = subsets@i[wh], of = j[wh]) + 1L
#     subset of
#[1,]      3  1

In both the last cases, the unique of the first column shows which elements are subset of another and can be used to [ "mylist".

Upvotes: 1

Andrew Gustar

Reputation: 18435

This gives a new list without the elements that are subsets of others...

newlist <- mylist[!sapply(seq_along(mylist), function(i) max(sapply(mylist[-i],function(L) all(mylist[[i]] %in% L))))]

Upvotes: 3

lmo

Reputation: 38520

Here is another method. It is also not super efficient, but will return the position of nested lists.

# get ordered pairwise combinations of list positions
combos <- combn(1:4, 2)
combos <- cbind(combos, combos[2:1,])

Ordered is important, because the comparison is not symmetric. Now, loop through these combinations and compare sets using intersect.

combos[1, sapply(seq_len(ncol(combos)),
                function(i) setequal(intersect(mylist[[combos[1,i]]], mylist[[combos[2,i]]]),
                                     mylist[[combos[1,i]]]))]
[1] 3

rewriting the last line to use mapply rather than sapply may improve readability.

combos[1, mapply(function(x, y) setequal(intersect(mylist[[x]], mylist[[y]]), mylist[[x]]),
                 combos[1,], combos[2,])]
[1] 3

Upvotes: 1

d.b

Reputation: 32558

which(t(sapply(seq_along(mylist), function(i)
    sapply(mylist[-i], function(a)
        all(unlist(mylist[i]) %in% a)))), arr.ind = TRUE)
#     row col
#[1,]   3   1
#Suggests that 3rd item is contained within 1st item

Upvotes: 2

MrFlick

Reputation: 206556

This is may be pretty inefficient, but if your list isn't that big it could work

find_nested <- function(mylist) {
    mm <- sapply(mylist, function(x) sapply(mylist, function(y) all(x %in%y)))
    diag(mm) <- FALSE
    apply(mm,2,any)
}

This can tell you which vectors are subsets of other vectors. It does this by comparing every vector to every other vector.

find_nested(mylist)
# [1] FALSE FALSE  TRUE FALSE

So we can see the third item is contained in another list.

Upvotes: 3

remove vectors which are subsets of other vectors in a list

Answers (5)

Related Questions