Reputation: 5646
I would like to build a structure which, for each record, stores a string, an index and a numeric value. I would like to be able to access the numeric value by querying the data structure with either the index or the string. Also, the data structure is small (on the order of 30 records) but it must be accessed and modified many times (possibly even a million times). Normally I would just use a data frame, but given the efficiency requirements, do you think there would be a better (faster) way? Judging by the syntax, I have the impression that my_struct
needs to be accessed two times for each operation (read or write): maybe it's not a big deal, but I wonder if expert R coders, when efficiency is a constraint, would use this code or something different.
# define data structure
my_struct <- data.frame(index = c(3:14,24), variable = c("Pin", "Pout", "Tout", "D", "L", "mu", "R","K","c","omega","alpha","beta","gamma"), value = runif(13), stringsAsFactors = FALSE)
# examples of read/write statements
my_struct$value[my_struct$variable == "Pin"]
my_struct$value[my_struct$index %in% c(3:14)]
my_struct$value[my_struct$index %in% c(3,5)] <- rnorm(2)
Upvotes: 0
Views: 95
Reputation: 66819
The data.table package supports indices and has nice syntax for read and write:
library(data.table)
dat <- data.table(index = c(3:14,24), variable = c("Pin", "Pout", "Tout", "D", "L", "mu", "R","K","c","omega","alpha","beta","gamma"), value = runif(13))
setindex(dat, index)
setindex(dat, variable)
# read
dat[ index %in% 3:4, value ]
# write
dat[ index %in% 3:4, value := 2:3 ]
To see how the index works, add verbose = TRUE
, like dat[ index %in% 3:4, value := 2:3, verbose = TRUE ]
and read the vignettes. (Indices are covered in the fourth one.)
Benchmark for OP's example
library(microbenchmark)
datDF = data.frame(dat)
n_idx = 2L
idxcol = "variable"
idx = sample(dat[[idxcol]], n_idx)
v = rnorm(length(idx))
e = substitute(idxcol %in% idx, list(idxcol = as.name(idxcol)))
microbenchmark(
DT = dat[eval(e), value := v ],
DF = datDF$value[ datDF[[idxcol]] %in% idx ] <- v
)
# Unit: microseconds
# expr min lq mean median uq max neval
# DT 449.694 473.136 487.17583 481.042 487.0065 1049.193 100
# DF 27.742 30.239 44.21525 36.065 38.4225 854.723 100
So it's actually slower. I'd still go with it for the (in my opinion) nicer syntax. Note that dplyr has no syntax for updating a subset of rows.
With a large table, you'd see the benchmark reversed:
dat = data.table(variable = do.call(paste0, CJ(LETTERS, LETTERS, LETTERS, LETTERS)))
dat[, index := .I ]
dat[, value := rnorm(.N) ]
setindex(dat, index)
setindex(dat, variable)
datDF = data.frame(dat)
n_idx = 2L
idxcol = "variable"
idx = sample(dat[[idxcol]], n_idx)
v = rnorm(length(idx))
e = substitute(idxcol %in% idx, list(idxcol = as.name(idxcol)))
microbenchmark(
DT = dat[eval(e), value := v ],
DF = datDF$value[ datDF[[idxcol]] %in% idx ] <- v
)
# Unit: microseconds
# expr min lq mean median uq max neval
# DT 471.887 492.5545 701.7914 757.766 817.827 1647.582 100
# DF 17387.134 17729.3280 23750.6721 22629.490 25912.309 83057.928 100
Note: The DF way can also be written datDF$value[ match(idx, datDF[[idxcol]]) ] <- v
, but I'm seeing about the same timing.
Upvotes: 2