Reputation: 488
What algorithm does the which() method in native R use to search through data frames? For example, If I call
df[which(df$parameter == 3)]
What algorithm does R use to search through this data frame? Binary search? Linear Search? I can't find documentation on this anywhere. If it doesn't use any of the 2 above, what algorithm does it use and whats its time-complexity?
Upvotes: 4
Views: 495
Reputation: 6226
If you are interested in that question, you can have a look to the datatable vignette on the subject
By default, R
uses a linear search. It scans the whole vector. This is one of the great feature of data.table
to enable binary search
with indexed table. In data.table
, an indexed dataset will have elements ordered physically in memory so searching values into it is very fast. Secondary indices in data.table
(see here) will enable binary search but data will not be reordered physically in contiguous memory slots.
Maybe the data.table
vignette is more explicit:
It can be seen that with every search we reduce the number of searches by half. This is why binary search based subsets are incredibly fast. Since rows of each column of data.tables have contiguous locations in memory, the operations are performed in a very cache efficient manner (also contributes to speed).
In addition, since we obtain the matching row indices directly without having to create those huge logical vectors (equal to the number of rows in a data.table), it is quite memory efficient as well.
Let's see with an example. The package microbenchmark
is used for comparisons.
We will compare:
data.frame
dplyr
data.table
data.table
with primary keys (setkey
is used)data.table
with secondary keys (setindex
is used)Let's create every pieces needed:
library(data.table)
library(dplyr)
df <- data.frame(
x = rnorm(1e6),
y = rnorm(1e6)
)
dt <- as.data.table(df)
dt2 <- copy(dt)
dt3 <- copy(dt)
We will set primary key on dt2
(binary search + object reordered in memory) and secondary key on dt3
(binary search but no reordering in memory)
setkey(dt2, x)
setindex(dt3, x)
The microbenchmark:
m <- microbenchmark::microbenchmark(
df[df$x<0,],
df %>% dplyr::filter(x<0),
dt[x<0],
dt2[x<0],
dt3[x<0],
times = 20L
)
m
Unit: milliseconds
expr min lq mean median uq max neval
df[df$x < 0, ] 56.56557 57.54392 66.97838 61.39609 75.30391 91.42418 20
df %>% dplyr::filter(x < 0) 23.24242 24.15183 34.64290 26.02232 34.91405 143.10476 20
dt[x < 0] 18.32496 18.96585 21.35255 20.25604 23.02666 33.25656 20
dt2[x < 0] 10.85129 10.94804 11.92941 11.21601 11.80469 18.29040 20
dt3[x < 0] 18.37789 18.47568 19.51928 18.76135 19.39782 26.90826 20
The base R approach is the slowest, by far. In this example, using secondary indexes does not improve that much performance. But primary keys do !
ggplot2::autoplot(m)
Upvotes: 3