DeltaIV
DeltaIV

Reputation: 5646

Efficient data structure to store a string, an integer and a real number for each record

I would like to build a structure which, for each record, stores a string, an index and a numeric value. I would like to be able to access the numeric value by querying the data structure with either the index or the string. Also, the data structure is small (on the order of 30 records) but it must be accessed and modified many times (possibly even a million times). Normally I would just use a data frame, but given the efficiency requirements, do you think there would be a better (faster) way? Judging by the syntax, I have the impression that my_struct needs to be accessed two times for each operation (read or write): maybe it's not a big deal, but I wonder if expert R coders, when efficiency is a constraint, would use this code or something different.

# define data structure
my_struct <- data.frame(index = c(3:14,24), variable = c("Pin", "Pout", "Tout", "D", "L", "mu", "R","K","c","omega","alpha","beta","gamma"), value = runif(13), stringsAsFactors = FALSE)

# examples of read/write statements 
my_struct$value[my_struct$variable == "Pin"]
my_struct$value[my_struct$index %in% c(3:14)]
my_struct$value[my_struct$index %in% c(3,5)] <- rnorm(2)

Upvotes: 0

Views: 95

Answers (1)

Frank
Frank

Reputation: 66819

The data.table package supports indices and has nice syntax for read and write:

library(data.table)
dat <- data.table(index = c(3:14,24), variable = c("Pin", "Pout", "Tout", "D", "L", "mu", "R","K","c","omega","alpha","beta","gamma"), value = runif(13))

setindex(dat, index)
setindex(dat, variable)

# read
dat[ index %in% 3:4, value ]

# write
dat[ index %in% 3:4, value := 2:3 ]

To see how the index works, add verbose = TRUE, like dat[ index %in% 3:4, value := 2:3, verbose = TRUE ] and read the vignettes. (Indices are covered in the fourth one.)

Benchmark for OP's example

library(microbenchmark)
datDF = data.frame(dat)

n_idx  = 2L
idxcol = "variable"
idx    = sample(dat[[idxcol]], n_idx)
v      = rnorm(length(idx))
e      = substitute(idxcol %in% idx, list(idxcol = as.name(idxcol)))
microbenchmark(
  DT  = dat[eval(e), value := v ],
  DF  = datDF$value[ datDF[[idxcol]] %in% idx ] <- v
)

# Unit: microseconds
#  expr     min      lq      mean  median       uq      max neval
#    DT 449.694 473.136 487.17583 481.042 487.0065 1049.193   100
#    DF  27.742  30.239  44.21525  36.065  38.4225  854.723   100

So it's actually slower. I'd still go with it for the (in my opinion) nicer syntax. Note that dplyr has no syntax for updating a subset of rows.

With a large table, you'd see the benchmark reversed:

dat = data.table(variable = do.call(paste0, CJ(LETTERS, LETTERS, LETTERS, LETTERS)))
dat[, index := .I ]
dat[, value := rnorm(.N) ]
setindex(dat, index)
setindex(dat, variable)

datDF = data.frame(dat)

n_idx  = 2L
idxcol = "variable"
idx    = sample(dat[[idxcol]], n_idx)
v      = rnorm(length(idx))
e      = substitute(idxcol %in% idx, list(idxcol = as.name(idxcol)))
microbenchmark(
  DT = dat[eval(e), value := v ],
  DF = datDF$value[ datDF[[idxcol]] %in% idx ] <- v
)

# Unit: microseconds
#  expr       min         lq       mean    median        uq       max neval
#    DT   471.887   492.5545   701.7914   757.766   817.827  1647.582   100
#    DF 17387.134 17729.3280 23750.6721 22629.490 25912.309 83057.928   100

Note: The DF way can also be written datDF$value[ match(idx, datDF[[idxcol]]) ] <- v, but I'm seeing about the same timing.

Upvotes: 2

Related Questions