Nakul Upadhya
Nakul Upadhya

Reputation: 488

What algorithm does R use to search through data frames?

What algorithm does the which() method in native R use to search through data frames? For example, If I call

df[which(df$parameter == 3)]

What algorithm does R use to search through this data frame? Binary search? Linear Search? I can't find documentation on this anywhere. If it doesn't use any of the 2 above, what algorithm does it use and whats its time-complexity?

Upvotes: 4

Views: 495

Answers (1)

linog
linog

Reputation: 6226

If you are interested in that question, you can have a look to the datatable vignette on the subject

By default, R uses a linear search. It scans the whole vector. This is one of the great feature of data.table to enable binary search with indexed table. In data.table, an indexed dataset will have elements ordered physically in memory so searching values into it is very fast. Secondary indices in data.table(see here) will enable binary search but data will not be reordered physically in contiguous memory slots.

Maybe the data.table vignette is more explicit:

It can be seen that with every search we reduce the number of searches by half. This is why binary search based subsets are incredibly fast. Since rows of each column of data.tables have contiguous locations in memory, the operations are performed in a very cache efficient manner (also contributes to speed).

In addition, since we obtain the matching row indices directly without having to create those huge logical vectors (equal to the number of rows in a data.table), it is quite memory efficient as well.

Example

Let's see with an example. The package microbenchmark is used for comparisons.

We will compare:

  • (Linear) search on a data.frame
  • Filter with dplyr
  • (Linear) search on a data.table
  • (Binary) search on a data.table with primary keys (setkey is used)
  • (Binary) search on a data.table with secondary keys (setindex is used)

Let's create every pieces needed:

library(data.table)
library(dplyr)
df <- data.frame(
  x = rnorm(1e6),
  y = rnorm(1e6)
)
dt <- as.data.table(df) 

dt2 <- copy(dt)
dt3 <- copy(dt)

We will set primary key on dt2 (binary search + object reordered in memory) and secondary key on dt3 (binary search but no reordering in memory)

setkey(dt2, x)
setindex(dt3, x)

The microbenchmark:

m <- microbenchmark::microbenchmark(
  df[df$x<0,],
  df %>% dplyr::filter(x<0),
  dt[x<0],
  dt2[x<0],
  dt3[x<0],
  times = 20L
)

m 

Unit: milliseconds
                        expr      min       lq     mean   median       uq       max neval
              df[df$x < 0, ] 56.56557 57.54392 66.97838 61.39609 75.30391  91.42418    20
 df %>% dplyr::filter(x < 0) 23.24242 24.15183 34.64290 26.02232 34.91405 143.10476    20
                   dt[x < 0] 18.32496 18.96585 21.35255 20.25604 23.02666  33.25656    20
                  dt2[x < 0] 10.85129 10.94804 11.92941 11.21601 11.80469  18.29040    20
                  dt3[x < 0] 18.37789 18.47568 19.51928 18.76135 19.39782  26.90826    20

The base R approach is the slowest, by far. In this example, using secondary indexes does not improve that much performance. But primary keys do !

ggplot2::autoplot(m)

enter image description here

Upvotes: 3

Related Questions