Slow data.table subsetting vs dplyr

Question

I have been using the following format of dplyr code

 group_by(dt, ID) %>%
    filter(any(colY == 1 & colX == 10))

To subset a data.table like the following

ID  colX    colY
1111    3   1
1111    2   1
1111    6   0
1111    9   0
2222    10  1
2222    3   1
2222    5   0
2222    7   0
3333    8   1
3333    10  1
3333    3   0
3333    2   0

into

ID  colX    colY
2222    10  1
2222    3   1
2222    5   0
2222    7   0
3333    8   1
3333    10  1
3333    3   0
3333    2   0

Filtering through about 900k rows to get the second table takes about 1.3 seconds.

I have been trying to implement a data.table subset that would be faster, but so far the results have only taken longer. Using the following data.table subset

dt[ , .SD[any( (colY == 1 & colX == 10)) ], ID]

Takes ~14 seconds. What seems to be the problem here?

Rich Scriven · Accepted Answer

This might be faster. It avoids the use of .SD and uses the row numbers given in .I instead.

dt[dt[, .I[any(colX == 10 & colY == 1)], by = ID]$V1]
#      ID colX colY
# 1: 2222   10    1
# 2: 2222    3    1
# 3: 2222    5    0
# 4: 2222    7    0
# 5: 3333    8    1
# 6: 3333   10    1
# 7: 3333    3    0
# 8: 3333    2    0

The inner data table call dt[, .I[any(colX == 10 & colY == 1)], by = ID]$V1 is giving us the row numbers for those groups that match our condition. .I gives us the row location per group. We can see what the result will be by printing our call:

dt[, print(.I[any(colX == 10 & colY == 1)]), by = ID]
# integer(0)
# [1] 5 6 7 8
# [1]  9 10 11 12
# Empty data.table (0 rows) of 1 col: ID

Then we just use that result as a row subset on the original data table.

Slow data.table subsetting vs dplyr

Answers (1)

Related Questions