R. Zhu
R. Zhu

Reputation: 415

store a vector in each cell by group

Hi everyone I'm searching for a way to store a vector in each cell, here's the sample dataset.

dt1 <- data.table(id = rep(1:2, each = 3), set.a = c(5,1,3,10,4,7))
dt1
   id set.a
1:  1     5
2:  1     1
3:  1     3
4:  2    10
5:  2     4
6:  2     7

Now what I want is a method to convert dt1 to something like this:

   id    set.a
1:  1    5,1,3
2:  2 10, 4, 7

This question may be very elementary but it did consume me nearly an hour. In fact it's part of another job. Suppose I have another dataset looks like:

dt2 <- data.table(id = rep(1:2, each = 3), set.b = c(3,5,9,8,10,4))
dt
   id set.b
1:  1     3
2:  1     5
3:  1     9
4:  2     8
5:  2    10
6:  2     4

What I'm truly looking for is how to compute the overlap of each id, i.e, a overlap variable to indicate the intersetion of set.a and set.b by each id looks like

   id overlap
1:  1     5,3
2:  2   10, 4

To implement this, I plan to first aggregate the set variable by each id into a vector and then calculate their intersection, but I failed to store a vector in each cell, just as what I mentioned at the beginning. Is there anyone can help me out? Thanks in advance.

Update

I experimented both akrun's and Frank's way and found the data.table merge way is much more efficient. Here is a little benchmark, thanks for both of you :)

dt1 <- data.table(id = rep(1:10000, each = 10), set1 = sample(letters[1:24], 100000, replace = T))
dt2 <- data.table(id = rep(1:10000, each = 10), set2 = sample(letters[1:24], 100000, replace = T))

system.time({
re1 <- rbindlist(list(dt1, dt2), idcol=TRUE)[,
  .(overlap=toString(intersect(set1[.id==1], set1[.id==2]))) , by =id]
}) 
# 0.25s

system.time({
re2 <- dt1[dt2, on = c(id = "id", set1 = "set2"), nomatch = 0][, .(ovlp = list(unique(set1))), by = "id"]
})  
# 0.07s

system.time({
dt3 <- dt1[, .(set1 = list(set1)), by = id]
dt4 <- dt2[, .(set2 = list(set2)), by = id]
re3 <- dt3[dt4, nomatch = 0, on = "id"][, .(ov = list(intersect(unlist(set1), unlist(set2)))), by = id]
})  
# 0.21s

Upvotes: 2

Views: 81

Answers (2)

Frank
Frank

Reputation: 66819

I would use a merge:

res <- merge(
  dt1[, .(a = list(set.a)), by=id], # <- answer to the first question
  dt2[, .(b = list(set.b)), by=id], 
  by="id"
)[,
  overlap := .(Map(intersect, a, b))
]

#    id        a        b overlap
# 1:  1    5,1,3    3,5,9     5,3
# 2:  2 10, 4, 7  8,10, 4   10, 4

If speed is a concern, I suspect this is faster:

dt1[dt2, on = c(id = "id", set.a = "set.b"), nomatch=0][,
  .(overlap = list(set.a))
, by=id]

#    id overlap
# 1:  1     3,5
# 2:  2   10, 4

dt1[dt2,...] is just another kind of merge/join.

Upvotes: 1

akrun
akrun

Reputation: 886978

We can use toString to paste the elements in 'set.a', grouped by the 'id' variable.

dt1[, .(set.a=toString(set.a)), by = id]
#   id    set.a
#1:  1  5, 1, 3
#2:  2 10, 4, 7

For the second case, we rbind both the datasets to a single one (rbindlist(list(...)) with idcol=TRUE. Get the intersect of 'set.a' for .id, grouped by the 'id' column and paste the elements together using toString.

 rbindlist(list(dt1, dt2), idcol=TRUE)[,
   .(overlap=toString(intersect(set.a[.id==1], set.a[.id==2]))) , by =id]
#   id overlap
#1:  1    5, 3
#2:  2   10, 4

Or as @Richard Scriven mentioned in the comments, we paste the elements that are duplicated in the 'set.a' by the 'id' column after we rbind both the datasets.

rbindlist(list(dt1, dt2))[,
    .(set.a = toString(set.a[duplicated(set.a)])), by = id]

Upvotes: 3

Related Questions