Reputation: 7760

Get all rows fulfilling a condition by group with R data.table

Say we have this toy data.table

prueba  <- data.table(id=c(1,1,1,1,2,2,3,3,4), kk=c("FA", "N","N","N",NA,"FA","N", "FA", "N"), rrr=1:9)

id kk rrr
1 FA   1
1  N   2
1  N   3
1  N   4
2 NA   5
2 FA   6
3  N   7
3 FA   8
4  N   9

And we want to retrieve all rows pertaining of a given "id" if that id contains any "FA" value on the kk column.

I've got to do it in this way:

prueba[id %in% prueba[,any(kk=="FA", na.rm=T),
   by=id]$id[prueba[,any(kk=="FA", na.rm=T),by=id]$V1],]

id kk rrr
1 FA   1
1  N   2
1  N   3
1  N   4
2 NA   5
2 FA   6
3  N   7
3 FA   8

(We get all rows with id=1,2 and 3).

But I think it's too long and not optimized.

How would you do it easily with data.table?

Upvotes: 1

Answers (3)

skan

Reputation: 7760

I've been trying the different solutions with microbenchmark:

prueba  <- data.table(id=rep(c(1,1,1,1,2,2,3,3,4),1000000), kk=rep(c("FA", "N","N","N",NA,"FA","N", "FA", "N"),1000000), rrr=rep(1:9),1000000)

prueba[, if(any(kk == "FA")) .SD, by= id]               # docendo
prueba[id %in% unique(prueba[kk == "FA", id])]          # lmo
prueba[id %in% prueba[, .I[kk == "FA"], by = id]$id,]   # eddi
prueba[id %in% prueba[,any(kk=="FA", na.rm=T),by=id]
   $id[prueba[,any(kk=="FA", na.rm=T),by=id]$V1],]      # skan
prueba %>%   group_by(id) %>%   filter('FA'%in%kk)      # Andrew
prueba[prueba[kk == "FA", .(id)], on="id"]              # lmo

min       lq       mean     median       uq     max    name
2.206436 2.211022 2.258038 2.215607 2.283839 2.352071   docendo
1.456590 1.472334 1.596654 1.488077 1.666687 1.845296   lmo
2.767113 2.869260 2.953024 2.971408 3.045980 3.120552   eddi
3.431671 3.437914 3.451760 3.444157 3.461804 3.479451   skan
2.088516 2.247807 2.313196 2.407098 2.425535 2.443973   Andrew

The last solution by lmo doesn't work, it says:

Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, : Join results in more than 2^31 rows (internal vecseq reached physical limit). Very likely misspecified join. Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation.

I expected to see a much bigger difference between methods. Maybe with a different dataset. The fastest method so far seems to be:

prueba[id %in% unique(prueba[kk == "FA", id])]

I guess there must be better options using .I, .GRP or such functions.

Upvotes: 1

NpT

Reputation: 451

For a data.table case I would simplify your code to:

prueba  <- data.table(id=c(1,1,1,1,2,2,3,3,4), kk=c("FA", "N","N","N",NA,"FA","N", "FA", "N"), rrr=1:9)  

prueba[id %in% unique(prueba[kk=="FA",id])]

The output is:

   id kk rrr
1:  1 FA   1
2:  1  N   2
3:  1  N   3
4:  1  N   4
5:  2 NA   5
6:  2 FA   6
7:  3  N   7
8:  3 FA   8

Upvotes: 1

Andrew Taylor

Reputation: 3488

I'm not sure about optimized, but cleaned up and using dplyr:

library(dplyr)
prueba %>% 
    group_by(id) %>% 
    filter('FA'%in%kk)

# A tibble: 8 x 3
# Groups:   id [3]
     id    kk   rrr
  <dbl> <chr> <int>
1     1    FA     1
2     1     N     2
3     1     N     3
4     1     N     4
5     2  <NA>     5
6     2    FA     6
7     3     N     7
8     3    FA     8

Upvotes: 1

Get all rows fulfilling a condition by group with R data.table

Answers (3)

Related Questions