Why does data.table j have a different environment when directly calling mget() vs calling mget() inside another function?

Question

I would like to wrap mget() in a simple function so that it returns an unnamed list, and use this function within data.table j.

I've printed out the environment within the function body passed to data.table j. I found data.table j uses one environment when calling my function, and another when using unname(mget()). I have tried playing around with inherits, but used inherits=F here to be more strict about where we find the relevant variables.

This approach works:

library(data.table); library(purrr) 
# a list of functions the user can access
functionDictionary <- list(
  sum = sum,
  weighted_sum = function(x,y) sum(x)/sum(y)
)

grouping_vars <- c('cyl', 'vs')

# user defines here which calculations they wish to make with which 
# columns
userList <- list(
  reactive = list(names = c('my_var1', 'my_var2'),
                  calculations = list(
                    sum = c('hp'),
                    weighted_sum=c('hp', 'mpg')
                  ))
)

mtcars <- data.table(mtcars)
mtcars[,
         {
           env <- environment() # get env in datatable j
           print('grouping')
           print(names(env))
           functionList <- 
             map2(names(userList[['reactive']]$calculations), 
                  userList[['reactive']]$calculations,        
                    ~ do.call(functionDictionary[[.x]],   
                              unname(mget(.y, envir=env, 
                                          inherits=F)))
             )
           functionList # last expression in `{` is returned
         }
         , 
         by=grouping_vars
         ]

However, adding a simple wrapper to mget() fails to find 'hp', and indeed, it is not listed in the environment of the function body passed to data.table j.


mget_unnamed <- function(x,...) unname(mget(x, inherits=F, ...))

mtcars[,
         {
           env <- environment() # get env in datatable j
           print('grouping')
           print(names(env))
           functionList <- 
             map2(names(userList[['reactive']]$calculations), 
                  userList[['reactive']]$calculations,        
                    ~ do.call(functionDictionary[[.x]],   
                             mget_unnamed(.y, envir=env))
             )
           functionList # last expression in `{` is returned
         }
         , 
         by=grouping_vars
         ]

The error is: "Error: value for ‘hp’ not found."

Frank · Accepted Answer

Here's one way:

ff = function(d, g, uL, dict = functionDictionary){
  r    = uL$reactive
  nms  = r$names
  fns  = names(r$calculations)
  cols = r$calculations

  exprs = lapply(setNames(seq_along(nms), nms), function(ii){
    fx = substitute(dict[[f]], list(f=fns[[ii]]))
    cx = lapply(cols[[ii]], as.name)
    as.call(c(fx, cx))
  })
  cat("The expressions:
"); print(exprs)

  call = as.call(c(as.name("list"), exprs))
  cat("The call:
"); print(call)

  d[, eval(call), by=g]
}

Usage:

ff(mtcars, grouping_vars, userList)

The expressions:
$my_var1
dict[["sum"]](hp)

$my_var2
dict[["weighted_sum"]](hp, mpg)

The call:
list(my_var1 = dict[["sum"]](hp), my_var2 = dict[["weighted_sum"]](hp, 
    mpg))
   cyl vs my_var1   my_var2
1:   6  0     395  6.401945
2:   4  1     818  3.060232
3:   6  1     461  6.026144
4:   8  0    2929 13.855251
5:   4  0      91  3.500000

Comment. The map2 function from purrr has NSE of its own (with ~, .x and .y as seen in the OP) in addition to data.table's NSE so things might get messy even if you find a workaround for a particular case (like OP mentions eval(as.symbol(z)) works here).

I find the base R tools (like quote and substitute) generalize to more of my use cases; and eval is the standard approach to meta-programming with data.table and will allow use of its various optimizations. If those optimizations are important for your use case, you might want to look into changing the functionDictionary interface, since with verbose=TRUE we can see that only the second call below gets "GForce" optimization:

mtcars[, functionDictionary[["sum"]](hp), by=cyl, verbose=TRUE]
# ...
# lapply optimization is on, j unchanged as 'functionDictionary[["sum"]](hp)'
# GForce is on, left j unchanged
# ...
mtcars[, sum(hp), by=cyl, verbose=TRUE]
# ...
# lapply optimization is on, j unchanged as 'sum(hp)'
# GForce optimized j to 'gsum(hp)'
# ...

Why does data.table j have a different environment when directly calling mget() vs calling mget() inside another function?

Answers (1)

Related Questions