Jan Šimbera
Jan Šimbera

Reputation: 466

Count number of occurrences of vector in list

I have a list of vectors of variable length, for example:

q <- list(c(1,3,5), c(2,4), c(1,3,5), c(2,5), c(7), c(2,5))

I need to count the number of occurrences for each of the vectors in the list, for example (any other suitable datastructure acceptable):

list(list(c(1,3,5), 2), list(c(2,4), 1), list(c(2,5), 2), list(c(7), 1))

Is there an efficient way to do this? The actual list has tens of thousands of items so quadratic behaviour is not feasible.

Upvotes: 11

Views: 4580

Answers (2)

alexis_laz
alexis_laz

Reputation: 13122

match and unique accept and handle "list"s too (?match warns for being slow on "list"s). So, with:

match(q, unique(q))
#[1] 1 2 1 3 4 3

each element is mapped to a single integer. Then:

tabulate(match(q, unique(q)))
#[1] 2 1 2 1

And find a structure to present the results:

as.data.frame(cbind(vec = unique(q), n = tabulate(match(q, unique(q)))))
#      vec n
#1 1, 3, 5 2
#2    2, 4 1
#3    2, 5 2
#4       7 1

Alternatively to match(x, unique(x)) approach, we could map each element to a single value with deparseing:

table(sapply(q, deparse))
#
#         7 c(1, 3, 5)    c(2, 4)    c(2, 5) 
#         1          2          1          2

Also, since this is a case with unique integers, and assuming in a small range, we could map each element to a single integer after transforming each element to a binary representation:

n = max(unlist(q))
pow2 = 2 ^ (0:(n - 1))
sapply(q, function(x) tabulate(x, nbins = n))  # 'binary' form
sapply(q, function(x) sum(tabulate(x, nbins = n) * pow2))
#[1] 21 10 21 18 64 18

and then tabulate as before.

And just to compare the above alternatives:

f1 = function(x) 
{
    ux = unique(x)
    i = match(x, ux)
    cbind(vec = ux, n = tabulate(i))
}   

f2 = function(x)
{
    xc = sapply(x, deparse)
    i = match(xc, unique(xc))
    cbind(vec = x[!duplicated(i)], n = tabulate(i))
}  

f3 = function(x)
{
    n = max(unlist(x))
    pow2 = 2 ^ (0:(n - 1))
    v = sapply(x, function(X) sum(tabulate(X, nbins = n) * pow2))
    i = match(v, unique(v))
    cbind(vec = x[!duplicated(v)], n = tabulate(i))
}

q2 = rep_len(q, 1e3)

all.equal(f1(q2), f2(q2))
#[1] TRUE
all.equal(f2(q2), f3(q2))
#[1] TRUE

microbenchmark::microbenchmark(f1(q2), f2(q2), f3(q2))
#Unit: milliseconds
#   expr       min        lq      mean    median        uq       max neval cld
# f1(q2)  7.980041  8.161524 10.525946  8.291678  8.848133 178.96333   100  b 
# f2(q2) 24.407143 24.964991 27.311056 25.514834 27.538643  45.25388   100   c
# f3(q2)  3.951567  4.127482  4.688778  4.261985  4.518463  10.25980   100 a 

Another interesting alternative is based on ordering. R > 3.3.0 has a grouping function, built off data.table, which, along with the ordering, provides some attributes for further manipulation:

Make all elements of equal length and "transpose" (probably the most slow operation in this case, though I'm not sure how else to feed grouping):

n = max(lengths(q))
qq = .mapply(c, lapply(q, "[", seq_len(n)), NULL)

Use ordering to group similar elements mapped to integers:

gr = do.call(grouping, qq)
e = attr(gr, "ends") 
i = rep(seq_along(e), c(e[1], diff(e)))[order(gr)]
i
#[1] 1 2 1 3 4 3

then, tabulate as before. To continue the comparisons:

f4 = function(x)
{
    n = max(lengths(x))
    x2 = .mapply(c, lapply(x, "[", seq_len(n)), NULL)
    gr = do.call(grouping, x2)
    e = attr(gr, "ends") 
    i = rep(seq_along(e), c(e[1], diff(e)))[order(gr)]
    cbind(vec = x[!duplicated(i)], n = tabulate(i))
}

all.equal(f3(q2), f4(q2))
#[1] TRUE

microbenchmark::microbenchmark(f1(q2), f2(q2), f3(q2), f4(q2))
#Unit: milliseconds
#   expr       min        lq      mean    median        uq        max neval cld
# f1(q2)  7.956377  8.048250  8.792181  8.131771  8.270101  21.944331   100  b 
# f2(q2) 24.228966 24.618728 28.043548 25.031807 26.188219 195.456203   100   c
# f3(q2)  3.963746  4.103295  4.801138  4.179508  4.360991  35.105431   100 a  
# f4(q2)  2.874151  2.985512  3.219568  3.066248  3.186657   7.763236   100 a

In this comparison q's elements are of small length to accomodate for f3, but f3 (because of large exponentiation) and f4 (because of mapply) will suffer, in performance, if "list"s of larger elements are used.

Upvotes: 18

Sotos
Sotos

Reputation: 51582

One way is to paste each vector , unlist and tabulate, i.e.

table(unlist(lapply(q, paste, collapse = ',')))

#1,3,5   2,4   2,5     7 
#    2     1     2     1 

Upvotes: 7

Related Questions