Reputation: 83
I have been using the following format of dplyr code
group_by(dt, ID) %>%
filter(any(colY == 1 & colX == 10))
To subset a data.table like the following
ID colX colY 1111 3 1 1111 2 1 1111 6 0 1111 9 0 2222 10 1 2222 3 1 2222 5 0 2222 7 0 3333 8 1 3333 10 1 3333 3 0 3333 2 0
into
ID colX colY 2222 10 1 2222 3 1 2222 5 0 2222 7 0 3333 8 1 3333 10 1 3333 3 0 3333 2 0
Filtering through about 900k rows to get the second table takes about 1.3 seconds.
I have been trying to implement a data.table subset that would be faster, but so far the results have only taken longer. Using the following data.table subset
dt[ , .SD[any( (colY == 1 & colX == 10)) ], ID]
Takes ~14 seconds. What seems to be the problem here?
Upvotes: 4
Views: 880
Reputation: 99331
This might be faster. It avoids the use of .SD
and uses the row numbers given in .I
instead.
dt[dt[, .I[any(colX == 10 & colY == 1)], by = ID]$V1]
# ID colX colY
# 1: 2222 10 1
# 2: 2222 3 1
# 3: 2222 5 0
# 4: 2222 7 0
# 5: 3333 8 1
# 6: 3333 10 1
# 7: 3333 3 0
# 8: 3333 2 0
The inner data table call dt[, .I[any(colX == 10 & colY == 1)], by = ID]$V1
is giving us the row numbers for those groups that match our condition. .I
gives us the row location per group. We can see what the result will be by printing our call:
dt[, print(.I[any(colX == 10 & colY == 1)]), by = ID]
# integer(0)
# [1] 5 6 7 8
# [1] 9 10 11 12
# Empty data.table (0 rows) of 1 col: ID
Then we just use that result as a row subset on the original data table.
Upvotes: 5