OatMeal
OatMeal

Reputation: 45

Nested hashes in R

How does one efficiently nest deep hashes with variables in R? For example:

hash <- new.env(hash = TRUE, parent = emptyenv(), size = 100L)
foo <- 'food'
bar <- 'fruits'
var <- 'apple'
count <- 1

# this will work... but it's only one level in
hash[[foo]] <- count

# A deeper nest is needed, but does not work:
hash[[foo]][[bar]][[var]] <- count

# this gets closer, but foo and bar need to be evaluated as their variables
hash$foo$bar$var <- count

# examine the keys
ls(hash)

I saw the answer from the post here that this might not be available in R. Is this true? It appears with the $ assignment, we can go in a bit further, but variables are needed for the question here.

I saw that using the environment's hash capabilities are faster than using some packages and it might be best to learn the fastest way possible, but if a package is needed to do the job, then I guess it must be used.

Upvotes: 1

Views: 555

Answers (2)

Martin Morgan
Martin Morgan

Reputation: 46876

Here's a list-of-lists that has the properties that you'd like, in terms of adding arbitrary levels with no need to 'pre-declare'.

l = list()
l[[foo]][[bar]][[var]] = 2

list-of-lists to hash-of-hashes

After the fact, one could transform this from a list-of-lists to nested environments (this would allow the structure to be passed to a function and leaf nodes updated, for instance, without having to return the structure) with something like

as_environment = function(x) {
    if (is.list(x)) {
        x <- lapply(x, as_environment)
        x <- as.environment(x)
    }
    x
}

e = as_environment(l)

This shows that it is possible to have nested hashes in R.

Here's some data -- nested levels with 50, 100, or 1000 possible values, and 10000 total data points

m = c(50, 100, 1000)
n = 1000 * 10
d = list(
    a = sample(as.character(seq_len(m[[1]])), n, TRUE),
    b = sample(as.character(seq_len(m[[2]])), n, TRUE),
    c = sample(as.character(seq_len(m[[3]])), n, TRUE)
)

Here are some functions to measure performance

f0 = function(a, b, c, n) {
    ## data access fixed cost
    for (i in seq_len(n))
        c(a[[i]], b[[i]], c[[i]])
}

f1 = function(x, a, b, c, n) {
    ## creation / assignment
    for (i in seq_len(n))
        x[[ a[[i]] ]][[ b[[i]] ]][[ c[[i]] ]] <- 1
    x
}

f2 = function(x, a, b, c, n) {
    ## update
    for (i in seq_len(n))
        x[[ a[[i]] ]][[ b[[i]] ]][[ c[[i]] ]] <-
            x[[ a[[i]] ]][[ b[[i]] ]][[ c[[i]] ]] + 1
    x
}

Here's some benchmark data

library(microbenchmark)

l <- with(d, f1(list(), a, b, c, n))
e <- as_environment(l)
microbenchmark(
    with(d, f0(a, b, c, n)),
    with(d, f1(list(), a, b, c, n)),
    with(d, f2(l, a, b, c, n)),
    with(d, f2(e, a, b, c, n)),
    times = 10
)

with output...

Unit: milliseconds
                            expr      min       lq      mean    median        uq       max neval
         with(d, f0(a, b, c, n)) 16.59220 17.37859  19.36920  18.02578  20.46342  25.21631    10
 with(d, f1(list(), a, b, c, n)) 72.54094 74.24085  83.54071  81.90286  90.75257  98.03838    10
      with(d, f2(l, a, b, c, n)) 86.65550 96.49548 104.69007 101.74540 116.04673 135.76844    10
      with(d, f2(e, a, b, c, n)) 48.53202 52.89202  57.76179  55.37080  64.14356  69.74413    10

First, the unit of time is milliseconds. Second, the time to update a hash-of-hashes is less than 50% faster than to update a list-of-lists. If I increase n ten-fold, I see approximately 10-fold increases in these times -- the list-of-lists and hash-of-hashes are both scaling approximately linearly.

These timing strongly suggest that, performance-wise and at least for this scale of data, we might just as well use the straight-forward list-of-lists approach.

Nonetheless...

A Hash class supporting nested construction?

Here's a 'Hash' class, which is an environment

Hash <- function()
    structure(new.env(parent = emptyenv()), class = "Hash")
h = Hash()

If we try

h[[foo]][[bar]][[var]] <- 1

h is a Hash, but it contains a single key that is a list-of-lists

> h
<environment: 0x7fddbd00c490>
attr(,"class")
[1] "Hash"
> h[[foo]]
$fruits
$fruits$apple
[1] 1

This is because of the way R performs the evaluation of the assignment -- essentially from right to left, so creating a list(apple = 1) and then creating list(fruits = list(apple = 1)) before assigning this to our Hash / environment. I can't really see how to use the existing syntax to force the creation of an environment with the right-most assignment, but we could write an update method that coerces the list-of-lists to a Hash-of-Hashes on first assignment

## like as_environment, above...
as_Hash = function(x) {
    if (is.list(x)) {
        x <- lapply(x, as_Hash)
        x <- structure(as.environment(x), class = "Hash")
    }
    x
}

## re-define assignment of an element to a hash -- if it's a list-of-lists, 
## then coerce to a Hash-of-Hashes
`[[<-.Hash` <- function(x, i, value) {
    if (is.list(value))
        value <- as_Hash(value)
    assign(i, value, x)
    x
}

The result is then always a Hash-of-Hashes once an assignment is complete.

> h = Hash()
> h[[foo]][[bar]][[var]] <- 1
> h[[foo]][["vegetable"]][["tomato"]] <- 2
> h
<environment: 0x7fddbb7102c8>
attr(,"class")
[1] "Hash"
> h[[foo]]
<environment: 0x7fddbb719ca8>
attr(,"class")
[1] "Hash"
> ls(h[[foo]])
[1] "fruits"    "vegetable"
> h[[foo]][[bar]]
<environment: 0x7fddbb718d80>
attr(,"class")
[1] "Hash"
> h[[foo]][[bar]][[var]]
[1] 1

But is it necessary?

Returning to our original example

l = list()
l[[foo]][[bar]][[var]] = 2

You can get a sense for how R is organizing things using .Internal(inspect(l))

> .Internal(inspect(l))
@7f886228d6d0 19 VECSXP g0c1 [REF(1),ATT] (len=1, tl=0)
  @7f886228d698 19 VECSXP g0c1 [REF(1),ATT] (len=1, tl=0)
    @7f886228d660 19 VECSXP g0c1 [REF(1),ATT] (len=1, tl=0)
      @7f886228d740 14 REALSXP g0c1 [REF(3)] (len=1, tl=0) 2
## ... additional output, dealing with the names (ATTRIB) at each level

This says that l is represented by memory at a particular address @7f886228d6d0 representing R's internal representation of list (a VECSXP). The VECSXP is of length 1, and points to another list / VECSXP at @7f886228d698. This points to another list / VECSXP at @7f886228d660, which contains the value you assigned -- a REALSXP at @7f886228d740.

What happens if you update an element?

> l[[foo]][[bar]][[var]] <- 3
> .Internal(inspect(l))
@7f886228d6d0 19 VECSXP g0c1 [REF(1),ATT] (len=1, tl=0)
  @7f886228d698 19 VECSXP g0c1 [REF(1),ATT] (len=1, tl=0)
    @7f886228d660 19 VECSXP g0c1 [REF(1),ATT] (len=1, tl=0)
      @7f886228d5f0 14 REALSXP g0c1 [REF(3)] (len=1, tl=0) 3
...

Notice that only the memory location of the REALSXP has changed, so you HAVE NOT copied the entire structure, just the part that actually changed. Good.

What about adding another fruit?

> l[[foo]][[bar]][["pear"]] <- 4
> .Internal(inspect(l))
@7f886228d6d0 19 VECSXP g0c1 [REF(1),ATT] (len=1, tl=0)
  @7f886228d698 19 VECSXP g0c1 [REF(1),ATT] (len=1, tl=0)
    @7f885dc9ff48 19 VECSXP g0c2 [REF(1),ATT] (len=2, tl=0)
      @7f886228d5f0 14 REALSXP g0c1 [REF(3)] (len=1, tl=0) 3
      @7f88622a4200 14 REALSXP g0c1 [REF(3)] (len=1, tl=0) 4
...

We've added a REALSXP for the pear, but also changed the VECSXP fruits. We didn't change the REALSXP for the apple, or the other VECSXP -- we are again only changing only (or almost only) the memory that requires changing.

And changing elements higher in the food chain ;)?

> l[[foo]][["vegetables"]][["tomato"]] <- 4
> .Internal(inspect(l))
@7f886228d6d0 19 VECSXP g0c1 [REF(1),ATT] (len=1, tl=0)
  @7f885dca0148 19 VECSXP g0c2 [REF(1),ATT] (len=2, tl=0)
    @7f885dc9ff48 19 VECSXP g0c2 [REF(1),ATT] (len=2, tl=0)
      @7f886228d5f0 14 REALSXP g0c1 [REF(3)] (len=1, tl=0) 3
      @7f88622a4200 14 REALSXP g0c1 [REF(3)] (len=1, tl=0) 4
...
    @7f88622a3f98 19 VECSXP g0c1 [REF(1),ATT] (len=1, tl=0)
      @7f88622a4008 14 REALSXP g0c1 [REF(3)] (len=1, tl=0) 4

We changed the VECSXP corresponding to fruits and vegetables level, and of course added our tomato, but the other components of the data structure remain unchanged.

This is showing that R is making minimal copies of the data, so that we might expect this data structure to be relatively efficient for reasonable sized nested lists. It would be worth discovering whether this was the case before investing more (or this much!) effort!

Upvotes: 3

Waldi
Waldi

Reputation: 41240

This works if you pre-declare the lists:

count <- 1
hash <- new.env(hash = TRUE, parent = emptyenv(), size = 100L)
hash[[foo]] <- list()
hash[[foo]][[bar]] <- list()
hash[[foo]][[bar]][[var]] <- count

Upvotes: 3

Related Questions