Reputation: 45
How does one efficiently nest deep hashes with variables in R? For example:
hash <- new.env(hash = TRUE, parent = emptyenv(), size = 100L)
foo <- 'food'
bar <- 'fruits'
var <- 'apple'
count <- 1
# this will work... but it's only one level in
hash[[foo]] <- count
# A deeper nest is needed, but does not work:
hash[[foo]][[bar]][[var]] <- count
# this gets closer, but foo and bar need to be evaluated as their variables
hash$foo$bar$var <- count
# examine the keys
ls(hash)
I saw the answer from the post here that this might not be available in R. Is this true? It appears with the $ assignment, we can go in a bit further, but variables are needed for the question here.
I saw that using the environment's hash capabilities are faster than using some packages and it might be best to learn the fastest way possible, but if a package is needed to do the job, then I guess it must be used.
Upvotes: 1
Views: 555
Reputation: 46876
Here's a list-of-lists that has the properties that you'd like, in terms of adding arbitrary levels with no need to 'pre-declare'.
l = list()
l[[foo]][[bar]][[var]] = 2
After the fact, one could transform this from a list-of-lists to nested environments (this would allow the structure to be passed to a function and leaf nodes updated, for instance, without having to return the structure) with something like
as_environment = function(x) {
if (is.list(x)) {
x <- lapply(x, as_environment)
x <- as.environment(x)
}
x
}
e = as_environment(l)
This shows that it is possible to have nested hashes in R.
Here's some data -- nested levels with 50, 100, or 1000 possible values, and 10000 total data points
m = c(50, 100, 1000)
n = 1000 * 10
d = list(
a = sample(as.character(seq_len(m[[1]])), n, TRUE),
b = sample(as.character(seq_len(m[[2]])), n, TRUE),
c = sample(as.character(seq_len(m[[3]])), n, TRUE)
)
Here are some functions to measure performance
f0 = function(a, b, c, n) {
## data access fixed cost
for (i in seq_len(n))
c(a[[i]], b[[i]], c[[i]])
}
f1 = function(x, a, b, c, n) {
## creation / assignment
for (i in seq_len(n))
x[[ a[[i]] ]][[ b[[i]] ]][[ c[[i]] ]] <- 1
x
}
f2 = function(x, a, b, c, n) {
## update
for (i in seq_len(n))
x[[ a[[i]] ]][[ b[[i]] ]][[ c[[i]] ]] <-
x[[ a[[i]] ]][[ b[[i]] ]][[ c[[i]] ]] + 1
x
}
Here's some benchmark data
library(microbenchmark)
l <- with(d, f1(list(), a, b, c, n))
e <- as_environment(l)
microbenchmark(
with(d, f0(a, b, c, n)),
with(d, f1(list(), a, b, c, n)),
with(d, f2(l, a, b, c, n)),
with(d, f2(e, a, b, c, n)),
times = 10
)
with output...
Unit: milliseconds
expr min lq mean median uq max neval
with(d, f0(a, b, c, n)) 16.59220 17.37859 19.36920 18.02578 20.46342 25.21631 10
with(d, f1(list(), a, b, c, n)) 72.54094 74.24085 83.54071 81.90286 90.75257 98.03838 10
with(d, f2(l, a, b, c, n)) 86.65550 96.49548 104.69007 101.74540 116.04673 135.76844 10
with(d, f2(e, a, b, c, n)) 48.53202 52.89202 57.76179 55.37080 64.14356 69.74413 10
First, the unit of time is milliseconds. Second, the time to update a hash-of-hashes is less than 50% faster than to update a list-of-lists. If I increase n
ten-fold, I see approximately 10-fold increases in these times -- the list-of-lists and hash-of-hashes are both scaling approximately linearly.
These timing strongly suggest that, performance-wise and at least for this scale of data, we might just as well use the straight-forward list-of-lists approach.
Nonetheless...
Here's a 'Hash' class, which is an environment
Hash <- function()
structure(new.env(parent = emptyenv()), class = "Hash")
h = Hash()
If we try
h[[foo]][[bar]][[var]] <- 1
h
is a Hash, but it contains a single key that is a list-of-lists
> h
<environment: 0x7fddbd00c490>
attr(,"class")
[1] "Hash"
> h[[foo]]
$fruits
$fruits$apple
[1] 1
This is because of the way R performs the evaluation of the assignment -- essentially from right to left, so creating a list(apple = 1)
and then creating list(fruits = list(apple = 1))
before assigning this to our Hash / environment. I can't really see how to use the existing syntax to force the creation of an environment with the right-most assignment, but we could write an update method that coerces the list-of-lists to a Hash-of-Hashes on first assignment
## like as_environment, above...
as_Hash = function(x) {
if (is.list(x)) {
x <- lapply(x, as_Hash)
x <- structure(as.environment(x), class = "Hash")
}
x
}
## re-define assignment of an element to a hash -- if it's a list-of-lists,
## then coerce to a Hash-of-Hashes
`[[<-.Hash` <- function(x, i, value) {
if (is.list(value))
value <- as_Hash(value)
assign(i, value, x)
x
}
The result is then always a Hash-of-Hashes once an assignment is complete.
> h = Hash()
> h[[foo]][[bar]][[var]] <- 1
> h[[foo]][["vegetable"]][["tomato"]] <- 2
> h
<environment: 0x7fddbb7102c8>
attr(,"class")
[1] "Hash"
> h[[foo]]
<environment: 0x7fddbb719ca8>
attr(,"class")
[1] "Hash"
> ls(h[[foo]])
[1] "fruits" "vegetable"
> h[[foo]][[bar]]
<environment: 0x7fddbb718d80>
attr(,"class")
[1] "Hash"
> h[[foo]][[bar]][[var]]
[1] 1
Returning to our original example
l = list()
l[[foo]][[bar]][[var]] = 2
You can get a sense for how R is organizing things using .Internal(inspect(l))
> .Internal(inspect(l))
@7f886228d6d0 19 VECSXP g0c1 [REF(1),ATT] (len=1, tl=0)
@7f886228d698 19 VECSXP g0c1 [REF(1),ATT] (len=1, tl=0)
@7f886228d660 19 VECSXP g0c1 [REF(1),ATT] (len=1, tl=0)
@7f886228d740 14 REALSXP g0c1 [REF(3)] (len=1, tl=0) 2
## ... additional output, dealing with the names (ATTRIB) at each level
This says that l
is represented by memory at a particular address @7f886228d6d0
representing R's internal representation of list (a VECSXP
). The VECSXP
is of length 1, and points to another list / VECSXP
at @7f886228d698
. This points to another list / VECSXP
at @7f886228d660
, which contains the value you assigned -- a REALSXP
at @7f886228d740
.
What happens if you update an element?
> l[[foo]][[bar]][[var]] <- 3
> .Internal(inspect(l))
@7f886228d6d0 19 VECSXP g0c1 [REF(1),ATT] (len=1, tl=0)
@7f886228d698 19 VECSXP g0c1 [REF(1),ATT] (len=1, tl=0)
@7f886228d660 19 VECSXP g0c1 [REF(1),ATT] (len=1, tl=0)
@7f886228d5f0 14 REALSXP g0c1 [REF(3)] (len=1, tl=0) 3
...
Notice that only the memory location of the REALSXP
has changed, so you HAVE NOT copied the entire structure, just the part that actually changed. Good.
What about adding another fruit?
> l[[foo]][[bar]][["pear"]] <- 4
> .Internal(inspect(l))
@7f886228d6d0 19 VECSXP g0c1 [REF(1),ATT] (len=1, tl=0)
@7f886228d698 19 VECSXP g0c1 [REF(1),ATT] (len=1, tl=0)
@7f885dc9ff48 19 VECSXP g0c2 [REF(1),ATT] (len=2, tl=0)
@7f886228d5f0 14 REALSXP g0c1 [REF(3)] (len=1, tl=0) 3
@7f88622a4200 14 REALSXP g0c1 [REF(3)] (len=1, tl=0) 4
...
We've added a REALSXP
for the pear, but also changed the VECSXP
fruits. We didn't change the REALSXP
for the apple, or the other VECSXP
-- we are again only changing only (or almost only) the memory that requires changing.
And changing elements higher in the food chain ;)?
> l[[foo]][["vegetables"]][["tomato"]] <- 4
> .Internal(inspect(l))
@7f886228d6d0 19 VECSXP g0c1 [REF(1),ATT] (len=1, tl=0)
@7f885dca0148 19 VECSXP g0c2 [REF(1),ATT] (len=2, tl=0)
@7f885dc9ff48 19 VECSXP g0c2 [REF(1),ATT] (len=2, tl=0)
@7f886228d5f0 14 REALSXP g0c1 [REF(3)] (len=1, tl=0) 3
@7f88622a4200 14 REALSXP g0c1 [REF(3)] (len=1, tl=0) 4
...
@7f88622a3f98 19 VECSXP g0c1 [REF(1),ATT] (len=1, tl=0)
@7f88622a4008 14 REALSXP g0c1 [REF(3)] (len=1, tl=0) 4
We changed the VECSXP
corresponding to fruits and vegetables level, and of course added our tomato, but the other components of the data structure remain unchanged.
This is showing that R is making minimal copies of the data, so that we might expect this data structure to be relatively efficient for reasonable sized nested lists. It would be worth discovering whether this was the case before investing more (or this much!) effort!
Upvotes: 3
Reputation: 41240
This works if you pre-declare the lists:
count <- 1
hash <- new.env(hash = TRUE, parent = emptyenv(), size = 100L)
hash[[foo]] <- list()
hash[[foo]][[bar]] <- list()
hash[[foo]][[bar]][[var]] <- count
Upvotes: 3