Dave_L
Dave_L

Reputation: 383

reduce grouped data frame by intersecting values in each group

Is there a way in pure dplyr or data.table to group by one column and then reduce each group so that they end up only containing the rows where the values of another column intersect?

Here is an example:

library(tibble)
library(dplyr)
classification <- rep(c("A", "B", "C"), each = 10)
set.seed(123)
id.list <- lapply(1:3, function(i) sample(letters[1:7], 10, replace = T))
dtbl <- tibble(classification = classification, id = unlist(id.list), value = round(runif(30, 50, 99)))

This produces the following:

# A tibble: 30 x 3
   classification id    value
   <chr>          <chr> <dbl>
 1 A              g        84
 2 A              g        89
 3 A              c        51
 4 A              f        73
 5 A              c        87
 6 A              b        61
 7 A              b        66
 8 A              f        61
 9 A              c        57
10 A              e        70
11 B              d        70
12 B              f        68
13 B              f        57
14 B              a        57
15 B              b        61
16 B              c        73
17 B              e        63
18 B              c        92
19 B              c        52
20 B              a        72
21 C              d        89
22 C              a        56
23 C              a        77
24 C              e        60
25 C              c        56
26 C              b        87
27 C              g        94
28 C              b        68
29 C              a        83
30 C              f        55

The values of id that appear in each classification group are b, c, e and f. I can get the desired result by doing:

common.id <- Reduce(intersect, id.list)
dtbl.intersect <- filter(dtbl, id %in% common.id) %>% 
  arrange(classification, id, value)

Which gives me:

# A tibble: 20 x 3
   classification id    value
   <chr>          <chr> <dbl>
 1 A              b        61
 2 A              b        66
 3 A              c        51
 4 A              c        57
 5 A              c        87
 6 A              e        70
 7 A              f        61
 8 A              f        73
 9 B              b        61
10 B              c        52
11 B              c        73
12 B              c        92
13 B              e        63
14 B              f        57
15 B              f        68
16 C              b        68
17 C              b        87
18 C              c        56
19 C              e        60
20 C              f        55

But I don't like having to break out of a dplyr pipeline in order to create common.id. Is it possible to perform the entire process in dplyr or data.table?

EDIT As correctly pointed out in the responses: I'm looking for a solution that does not use the id.list list (that's just part of the MWE). A broader question is: does the dplyr/data.table grouping operation create a list "somewhere" that can be used by a reduce function?

Upvotes: 1

Views: 1126

Answers (4)

chinsoon12
chinsoon12

Reputation: 25225

Here are 2 comparable (in terms of speed) methods using data.table:

mtd0 <- function(DT) {
    IDs <- DT[, unique(id)]
    invisible(DT[, IDs <<- intersect(IDs, id), classification])
    DT[id %in% IDs][order(classification, id, value)]
}

mtd1 <- function(DT) {
    DT[DT[, .(id=Reduce(intersect, split(id, classification)))], on=.(id), nomatch=0L][
        order(classification, id, value)]
}

I think your id.list has confused us, making us think that it was available as a standalone variable although you merely used it to create a MWE.

data:

library(data.table) #data.table_1.12.4
set.seed(0L)
nr <- 1e7
nclass <- 1e4
nid <- 1e2
dat <- data.table(classification=sample(nclass, nr, TRUE),
    id=sample(nid, nr, TRUE))[, value := .I]
setorder(dat, classification, id, value)

timings from bench::mark(mtd0(dat), mtd1(dat)):

# A tibble: 2 x 13
  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result                   memory                time     gc              
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>                   <list>                <list>   <list>          
1 mtd0(DT)   810.17ms 810.17ms     1.23      897MB     3.70     1     3   810.17ms <df[,3] [6,197,991 x 3]> <df[,3] [80,334 x 3]> <bch:tm> <tibble [1 x 3]>
2 mtd1(DT)      1.33s    1.33s     0.752     854MB     3.01     1     4      1.33s <df[,3] [6,197,991 x 3]> <df[,3] [94,322 x 3]> <bch:tm> <tibble [1 x 3]>

Upvotes: 1

tmfmnk
tmfmnk

Reputation: 39858

You can do (with the addition of purrr):

dtbl %>%
 filter(id %in% reduce(id.list, intersect)) %>%
 arrange(classification, id, value)

Or:

dtbl %>%
 filter(id %in% Reduce(intersect, id.list)) %>%
 arrange(classification, id, value)

Or if the question is more focused on creating id.list:

dtbl %>%
 mutate(n = n_distinct(classification)) %>%
 group_by(id) %>%
 filter(n_distinct(classification) == n) %>%
 select(-n) %>%
 arrange(classification, id, value)

Upvotes: 1

s_baldur
s_baldur

Reputation: 33488

Here is what I would do:

setDT(dtbl)
result <- dtbl[id %chin% Reduce(intersect, id.list)] # %chin% is fast %in% for characters
setorder(result, classification, id, value)

Upvotes: 0

B. Christian Kamgang
B. Christian Kamgang

Reputation: 6489

You could use data.table package as follows:

  setDT(dtbl)
  dtbl[, .SD[id %in% Reduce(intersect, id.list)], by = classification
       ][order(classification, id)]

Upvotes: 3

Related Questions