SOUser
SOUser

Reputation: 610

best key type for R data.table

Are integer or smaller strings faster as keys in data.table? For example,

 dt1 = data.table(x = c("a","b","c","d","e"), y= c(1,2,3,4,5))

vs

 dt2 = data.table(x = c("ndjdnjndjndddjhjdhdhdbdjbjhfbdfbdfjhdbfd", "jnjwnjdndsjdsndjskndskjdnsdjsndskdnsk","jnjnsjncsccdjhcbdhjcbdcjhd","sjdnjdncjdncdcdcdccndcd","wjdndjnjcndcjdncdc"), y= c(1,2,3,4,5))

Would x in dt1 be a better/faster key than longer strings like in dt2.x? Put another way, how does string length impact speed?

Thanks!

Upvotes: 1

Views: 83

Answers (1)

KenHBS
KenHBS

Reputation: 7164

I have compared the performance of data.table objects with differently sized key-lengths during three different data.table-operations:

  1. creating a data.table
  2. setting keys for the data.table
  3. accessing lines in a data.table

Code

library(data.table)
library(random)
library(microbenchmark)

sizes = c(2, 5, 10, 20)  #Length of the strings we'll use as keys in the data.tables

# Generate random strings of different lengths:
randomstrings <- function(size){
  randomStrings(n = 100, len = size, upperalpha = F, digits = F, check = F)
}
keys <- lapply(sizes, randomstrings)  # The differently sized keys we'll use

# Create data table:
dt <- function(keys){data.table(x = keys, y = 1:100)}   

# Function that chooses 5 keys randomly (used to access lines in the benchmarking):
some5keys <- function(datatable){datatable[sample(datatable$x.V1, 5)]}

### BENCHMARKING ###
# Creating the data.tables:
(creationbench <- microbenchmark(dt1 <- dt(keys[[1]]), 
                                 dt2 <- dt(keys[[2]]), 
                                 dt3 <- dt(keys[[3]]), 
                                 dt4 <- dt(keys[[4]])))
# Unit: microseconds
# expr                 min     lq       mean     median   uq       max      neval
# dt1 <- dt(keys[[1]]) 562.926 609.1035 714.7314 672.5955 803.7075 1117.683   100
# dt2 <- dt(keys[[2]]) 565.636 605.7725 737.8285 661.0125 756.9390 5087.124   100
# dt3 <- dt(keys[[3]]) 563.347 606.8465 694.8140 631.6945 754.4420 1326.753   100
# dt4 <- dt(keys[[4]]) 578.101 622.4180 722.8112 708.4055 785.9755 1509.439   100

# Setting the keys for the data.tables:
(setkeybench <- (microbenchmark(setkey(dt1, x.V1), 
                                setkey(dt2, x.V1), 
                                setkey(dt3, x.V1), 
                                setkey(dt4, x.V1))))
# Unit: microseconds
# expr              min    lq      mean     median  uq      max       neval
# setkey(dt1, x.V1) 76.401 77.9530 82.28644 78.7440 81.3955 111.267   100
# setkey(dt2, x.V1) 75.620 77.7395 91.95130 79.6885 90.6075 343.743   100
# setkey(dt3, x.V1) 76.330 77.7900 84.21696 78.6290 83.8310 189.792   100
# setkey(dt4, x.V1) 76.044 77.8135 85.35959 79.1675 89.8920 129.458   100

# Accessing lines in the data.tables:
(selectbench <- (microbenchmark(some5keys(dt1), 
                           some5keys(dt2),
                           some5keys(dt3),
                           some5keys(dt4))))
# Unit: microseconds
# expr           min     lq       mean     median   uq       max      neval
# some5keys(dt1) 958.961 1029.778 1244.538 1131.350 1318.147 5389.407   100
# some5keys(dt2) 968.710 1037.023 1246.963 1131.209 1302.656 5890.560   100
# some5keys(dt3) 966.647 1025.569 1206.210 1140.247 1299.570 2221.324   100
# some5keys(dt4) 960.804 1042.528 1218.077 1171.347 1363.010 1813.551   100

It looks like the length of the key-string has absolutely no influence on the efficiency of the data.table operations.

Note that there are probably some other operations with data.table-objects you would want to compare.

Upvotes: 1

Related Questions