Reputation: 660
Is there a fast way to get indices of a value from data.table? I have set a column as key, however, I struggle to find efficient way to get its indices?
x <- sample(letters, 200, replace = TRUE)
y <- rnorm(200)
DT <- data.table(x, y, key = "x")
df <- data.frame(x, y)
Execution time:
system.time(for(i in 1:1000) DT[.("g"), which= TRUE]) # 0.3 sec
system.time(for(i in 1:1000) which(DT$x == "g")) # 0.004 sec
system.time(for(i in 1:1000) which(df$x == "g")) # 0.004 sec
I guess currently it is not able to use key for finding index in the last two execution. Is there any fast way?
Upvotes: 2
Views: 237
Reputation: 11255
You seem to be 1) running into the time it takes to use [.data.table
and 2) likely running into a lot of overhead to start the join operation only for only 200 rows. Going up to 2,000,000 rows results in the DT[.("g"), which = TRUE]
to be very fast.
library(data.table)
x <- sample(letters, 200, replace = TRUE)
y <- rnorm(200)
DT <- data.table(x, y, key = "x")
bench::mark(which(DT$x == "g"),
DT[.("g"), which = TRUE])
## # A tibble: 2 x 13
## expression min median `itr/sec` mem_alloc
## <bch:expr> <bch:t> <bch:t> <dbl> <bch:byt>
## 1 which(DT$x == "g") 7.9us 11.2us 88385. 1.66KB
## 2 DT[.("g"), which = TRUE] 735.8us 905.8us 1010. 64.73KB
## 20,000 rows:
## # A tibble: 2 x 13
## expression min median `itr/sec` mem_alloc
## <bch:expr> <bch> <bch:> <dbl> <bch:byt>
## 1 which(DT$x == "g") 251us 265us 3654. 159.5KB
## 2 DT[.("g"), which = TRUE] 744us 907us 879. 67.8KB
## 2,000,000 rows:
## # A tibble: 2 x 13
## expression min median `itr/sec` mem_alloc
## <bch:expr> <bch:t> <bch:> <dbl> <bch:byt>
## 1 which(DT$x == "g") 21900us 24.9ms 40.6 15.6MB
## 2 DT[.("g"), which = TRUE] 868us 1.1ms 724. 366.1KB
Upvotes: 4