Reputation: 7760
Say we have this toy data.table
prueba <- data.table(id=c(1,1,1,1,2,2,3,3,4), kk=c("FA", "N","N","N",NA,"FA","N", "FA", "N"), rrr=1:9)
id kk rrr
1 FA 1
1 N 2
1 N 3
1 N 4
2 NA 5
2 FA 6
3 N 7
3 FA 8
4 N 9
And we want to retrieve all rows pertaining of a given "id" if that id contains any "FA" value on the kk column.
I've got to do it in this way:
prueba[id %in% prueba[,any(kk=="FA", na.rm=T),
by=id]$id[prueba[,any(kk=="FA", na.rm=T),by=id]$V1],]
id kk rrr
1 FA 1
1 N 2
1 N 3
1 N 4
2 NA 5
2 FA 6
3 N 7
3 FA 8
(We get all rows with id=1,2 and 3).
But I think it's too long and not optimized.
How would you do it easily with data.table?
Upvotes: 1
Views: 1035
Reputation: 7760
I've been trying the different solutions with microbenchmark:
prueba <- data.table(id=rep(c(1,1,1,1,2,2,3,3,4),1000000), kk=rep(c("FA", "N","N","N",NA,"FA","N", "FA", "N"),1000000), rrr=rep(1:9),1000000)
prueba[, if(any(kk == "FA")) .SD, by= id] # docendo
prueba[id %in% unique(prueba[kk == "FA", id])] # lmo
prueba[id %in% prueba[, .I[kk == "FA"], by = id]$id,] # eddi
prueba[id %in% prueba[,any(kk=="FA", na.rm=T),by=id]
$id[prueba[,any(kk=="FA", na.rm=T),by=id]$V1],] # skan
prueba %>% group_by(id) %>% filter('FA'%in%kk) # Andrew
prueba[prueba[kk == "FA", .(id)], on="id"] # lmo
.
min lq mean median uq max name
2.206436 2.211022 2.258038 2.215607 2.283839 2.352071 docendo
1.456590 1.472334 1.596654 1.488077 1.666687 1.845296 lmo
2.767113 2.869260 2.953024 2.971408 3.045980 3.120552 eddi
3.431671 3.437914 3.451760 3.444157 3.461804 3.479451 skan
2.088516 2.247807 2.313196 2.407098 2.425535 2.443973 Andrew
The last solution by lmo doesn't work, it says:
Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, : Join results in more than 2^31 rows (internal vecseq reached physical limit). Very likely misspecified join. Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation.
I expected to see a much bigger difference between methods. Maybe with a different dataset. The fastest method so far seems to be:
prueba[id %in% unique(prueba[kk == "FA", id])]
I guess there must be better options using .I, .GRP or such functions.
Upvotes: 1
Reputation: 451
For a data.table case I would simplify your code to:
prueba <- data.table(id=c(1,1,1,1,2,2,3,3,4), kk=c("FA", "N","N","N",NA,"FA","N", "FA", "N"), rrr=1:9)
prueba[id %in% unique(prueba[kk=="FA",id])]
The output is:
id kk rrr
1: 1 FA 1
2: 1 N 2
3: 1 N 3
4: 1 N 4
5: 2 NA 5
6: 2 FA 6
7: 3 N 7
8: 3 FA 8
Upvotes: 1
Reputation: 3488
I'm not sure about optimized, but cleaned up and using dplyr:
library(dplyr)
prueba %>%
group_by(id) %>%
filter('FA'%in%kk)
# A tibble: 8 x 3
# Groups: id [3]
id kk rrr
<dbl> <chr> <int>
1 1 FA 1
2 1 N 2
3 1 N 3
4 1 N 4
5 2 <NA> 5
6 2 FA 6
7 3 N 7
8 3 FA 8
Upvotes: 1