Reputation: 415
Hi everyone I'm searching for a way to store a vector in each cell, here's the sample dataset.
dt1 <- data.table(id = rep(1:2, each = 3), set.a = c(5,1,3,10,4,7))
dt1
id set.a
1: 1 5
2: 1 1
3: 1 3
4: 2 10
5: 2 4
6: 2 7
Now what I want is a method to convert dt1
to something like this:
id set.a
1: 1 5,1,3
2: 2 10, 4, 7
This question may be very elementary but it did consume me nearly an hour. In fact it's part of another job. Suppose I have another dataset looks like:
dt2 <- data.table(id = rep(1:2, each = 3), set.b = c(3,5,9,8,10,4))
dt
id set.b
1: 1 3
2: 1 5
3: 1 9
4: 2 8
5: 2 10
6: 2 4
What I'm truly looking for is how to compute the overlap of each id, i.e, a overlap
variable to indicate the intersetion of set.a
and set.b
by each id looks like
id overlap
1: 1 5,3
2: 2 10, 4
To implement this, I plan to first aggregate the set
variable by each id into a vector and then calculate their intersection, but I failed to store a vector in each cell, just as what I mentioned at the beginning. Is there anyone can help me out? Thanks in advance.
I experimented both akrun's and Frank's way and found the data.table merge way is much more efficient. Here is a little benchmark, thanks for both of you :)
dt1 <- data.table(id = rep(1:10000, each = 10), set1 = sample(letters[1:24], 100000, replace = T))
dt2 <- data.table(id = rep(1:10000, each = 10), set2 = sample(letters[1:24], 100000, replace = T))
system.time({
re1 <- rbindlist(list(dt1, dt2), idcol=TRUE)[,
.(overlap=toString(intersect(set1[.id==1], set1[.id==2]))) , by =id]
})
# 0.25s
system.time({
re2 <- dt1[dt2, on = c(id = "id", set1 = "set2"), nomatch = 0][, .(ovlp = list(unique(set1))), by = "id"]
})
# 0.07s
system.time({
dt3 <- dt1[, .(set1 = list(set1)), by = id]
dt4 <- dt2[, .(set2 = list(set2)), by = id]
re3 <- dt3[dt4, nomatch = 0, on = "id"][, .(ov = list(intersect(unlist(set1), unlist(set2)))), by = id]
})
# 0.21s
Upvotes: 2
Views: 81
Reputation: 66819
I would use a merge:
res <- merge(
dt1[, .(a = list(set.a)), by=id], # <- answer to the first question
dt2[, .(b = list(set.b)), by=id],
by="id"
)[,
overlap := .(Map(intersect, a, b))
]
# id a b overlap
# 1: 1 5,1,3 3,5,9 5,3
# 2: 2 10, 4, 7 8,10, 4 10, 4
If speed is a concern, I suspect this is faster:
dt1[dt2, on = c(id = "id", set.a = "set.b"), nomatch=0][,
.(overlap = list(set.a))
, by=id]
# id overlap
# 1: 1 3,5
# 2: 2 10, 4
dt1[dt2,...]
is just another kind of merge/join.
Upvotes: 1
Reputation: 886978
We can use toString
to paste
the elements in 'set.a', grouped by the 'id' variable.
dt1[, .(set.a=toString(set.a)), by = id]
# id set.a
#1: 1 5, 1, 3
#2: 2 10, 4, 7
For the second case, we rbind
both the datasets to a single one (rbindlist(list(...)
) with idcol=TRUE
. Get the intersect
of 'set.a' for .id
, grouped by the 'id' column and paste
the elements together using toString
.
rbindlist(list(dt1, dt2), idcol=TRUE)[,
.(overlap=toString(intersect(set.a[.id==1], set.a[.id==2]))) , by =id]
# id overlap
#1: 1 5, 3
#2: 2 10, 4
Or as @Richard Scriven mentioned in the comments, we paste
the elements that are duplicated
in the 'set.a' by the 'id' column after we rbind
both the datasets.
rbindlist(list(dt1, dt2))[,
.(set.a = toString(set.a[duplicated(set.a)])), by = id]
Upvotes: 3