Reputation: 383
Is there a way in pure dplyr or data.table to group by one column and then reduce each group so that they end up only containing the rows where the values of another column intersect?
Here is an example:
library(tibble)
library(dplyr)
classification <- rep(c("A", "B", "C"), each = 10)
set.seed(123)
id.list <- lapply(1:3, function(i) sample(letters[1:7], 10, replace = T))
dtbl <- tibble(classification = classification, id = unlist(id.list), value = round(runif(30, 50, 99)))
This produces the following:
# A tibble: 30 x 3
classification id value
<chr> <chr> <dbl>
1 A g 84
2 A g 89
3 A c 51
4 A f 73
5 A c 87
6 A b 61
7 A b 66
8 A f 61
9 A c 57
10 A e 70
11 B d 70
12 B f 68
13 B f 57
14 B a 57
15 B b 61
16 B c 73
17 B e 63
18 B c 92
19 B c 52
20 B a 72
21 C d 89
22 C a 56
23 C a 77
24 C e 60
25 C c 56
26 C b 87
27 C g 94
28 C b 68
29 C a 83
30 C f 55
The values of id that appear in each classification group are b
, c
, e
and f
. I can get the desired result by doing:
common.id <- Reduce(intersect, id.list)
dtbl.intersect <- filter(dtbl, id %in% common.id) %>%
arrange(classification, id, value)
Which gives me:
# A tibble: 20 x 3
classification id value
<chr> <chr> <dbl>
1 A b 61
2 A b 66
3 A c 51
4 A c 57
5 A c 87
6 A e 70
7 A f 61
8 A f 73
9 B b 61
10 B c 52
11 B c 73
12 B c 92
13 B e 63
14 B f 57
15 B f 68
16 C b 68
17 C b 87
18 C c 56
19 C e 60
20 C f 55
But I don't like having to break out of a dplyr pipeline in order to create common.id
. Is it possible to perform the entire process in dplyr or data.table?
EDIT
As correctly pointed out in the responses: I'm looking for a solution that does not use the id.list
list (that's just part of the MWE). A broader question is: does the dplyr/data.table grouping operation create a list "somewhere" that can be used by a reduce function?
Upvotes: 1
Views: 1126
Reputation: 25225
Here are 2 comparable (in terms of speed) methods using data.table
:
mtd0 <- function(DT) {
IDs <- DT[, unique(id)]
invisible(DT[, IDs <<- intersect(IDs, id), classification])
DT[id %in% IDs][order(classification, id, value)]
}
mtd1 <- function(DT) {
DT[DT[, .(id=Reduce(intersect, split(id, classification)))], on=.(id), nomatch=0L][
order(classification, id, value)]
}
I think your id.list
has confused us, making us think that it was available as a standalone variable although you merely used it to create a MWE.
data:
library(data.table) #data.table_1.12.4
set.seed(0L)
nr <- 1e7
nclass <- 1e4
nid <- 1e2
dat <- data.table(classification=sample(nclass, nr, TRUE),
id=sample(nid, nr, TRUE))[, value := .I]
setorder(dat, classification, id, value)
timings from bench::mark(mtd0(dat), mtd1(dat))
:
# A tibble: 2 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
1 mtd0(DT) 810.17ms 810.17ms 1.23 897MB 3.70 1 3 810.17ms <df[,3] [6,197,991 x 3]> <df[,3] [80,334 x 3]> <bch:tm> <tibble [1 x 3]>
2 mtd1(DT) 1.33s 1.33s 0.752 854MB 3.01 1 4 1.33s <df[,3] [6,197,991 x 3]> <df[,3] [94,322 x 3]> <bch:tm> <tibble [1 x 3]>
Upvotes: 1
Reputation: 39858
You can do (with the addition of purrr
):
dtbl %>%
filter(id %in% reduce(id.list, intersect)) %>%
arrange(classification, id, value)
Or:
dtbl %>%
filter(id %in% Reduce(intersect, id.list)) %>%
arrange(classification, id, value)
Or if the question is more focused on creating id.list:
dtbl %>%
mutate(n = n_distinct(classification)) %>%
group_by(id) %>%
filter(n_distinct(classification) == n) %>%
select(-n) %>%
arrange(classification, id, value)
Upvotes: 1
Reputation: 33488
Here is what I would do:
setDT(dtbl)
result <- dtbl[id %chin% Reduce(intersect, id.list)] # %chin% is fast %in% for characters
setorder(result, classification, id, value)
Upvotes: 0
Reputation: 6489
You could use data.table
package as follows:
setDT(dtbl)
dtbl[, .SD[id %in% Reduce(intersect, id.list)], by = classification
][order(classification, id)]
Upvotes: 3