Reputation: 275
I've run into a strange (to me) behaviour in R's lexical scoping that results from first attaching a NULL environment to the search path, as suggested in the help file for attach(), and then populating it using sys.source().
Here is a simplified and reproducible example of the issue. I have 3 functions (f1, f2, and f3) in three separate files I wish to attach into three separate environments (env.A, env.B, and env.C, respectively). Here is the setup function:
setup <- function() {
for (i in sprintf('env.%s',LETTERS[1:3])) if (i%in%search())
detach(i, unload=TRUE, force=TRUE, character.only=TRUE) # detach existing to avoid duplicates
env.A = attach(NULL, name='env.A')
env.B = attach(NULL, name='env.B')
env.C = attach(NULL, name='env.C')
sys.source('one.R', envir=env.A)
sys.source('two.R', envir=env.B)
sys.source('three.R', envir=env.C)
}
setup()
Once this function is called, 3 new environments are created with the functions f1, f2, and f3 contained within each environment. Each function lives in one of 3 separate files: "one.R", "two.R", and "three.R". The functions are trivial:
f1 <- function() {
print('this is my f1 function')
return('ok')
}
f2 <- function() {
f1()
f3()
print('this is my f2 function')
return('ok')
}
f3 <- function() {
print('this is my f3 function')
return('ok')
}
As you can see, functions f1 and f3 have no dependencies, but function f2 depends on both f1 and f2. Calling search() shows the following:
[1] ".GlobalEnv" "env.C" "env.B"
[4] "env.A" "package:stats" "package:graphics"
[7] "package:grDevices" "package:utils" "package:datasets"
[10] "package:methods" "Autoloads" "package:base"
Calling f2, gives the following:
> f2()
[1] "this is my f1 function"
Error in f2() : could not find function "f3"
Clearly f2 can "see" f1, but it cannot find f3. Permuting the order the attached environments leads me to conclude that the order of the search path is critical. Functions lower down in the search path are visible, whereas functions "upstream" of where the function is being call from are not found.
In this case, f2 (env.B) found f1 (env.A), but could not find f3 (env.C). This is contrary to how I understand R's scoping rules (at least I thought I understood it). My understanding is that R first checks the local environment, then the enclosing environment, then any additional enclosing environments, then works its way down the search, starting with ".GlobalEnv", until it finds the first matching appropriate (function/object) name. If it makes it all the way to the "R_empty_env" then returns the "could not find function" error. This obviously isn't happening in this simple example.
What is happening? Why doesn't R traverse the entire search path and find f3 sitting in env.C? I assume there is something going on behind the scenes when the attach call is made. Perhaps some attributes are set detailing dependencies? I have found a workaround that does not run into this issue, whereby I create and populate the environment prior to attaching it. Using pseudocode:
env.A <- new.env(); ... B ... C
sys.source('one.R', envir=env.A)
...
attach(env.A)
...
This workaround exhibits a behaviour consistent with my expectations, but I am puzzled by the difference: attach then populate vs. populate then attach.
Comments, explanations, thoughts greatly appreciated. Thanks.
Upvotes: 2
Views: 143
Reputation: 275
I agree the answer seems to lie in the default parent assignment differing between attach()
and new.env()
. I find it a little strange that attach()
would assign parentage to the environment second in the search list by default, but it is what it is, there is probably a valid reason behind it. The solution is simple enough:
env.A <- attach(NULL, name='env.A')
parent.env(env.A) <- .GlobalEnv
In the alternate solution using new.env()
, there is a small caveat that you didn't run into because you were working directly in the .GlobalEnv, but in the OP, I was working within a temporary environment (the "setup" function). So the parent frame of the new.env()
call is actually this setup
environment. See below:
setup <- function() {
env.A <- new.env(); env.B <- new.env(); env.C <- new.env()
print(parent.env(environment()))
print(parent.frame())
print(environment())
print(parent.env(env.A))
print(parent.env(env.B))
print(parent.env(env.C))
}
setup()
#<environment: R_GlobalEnv>
#<environment: R_GlobalEnv>
#<environment: 0x2298368>
#<environment: 0x2298368>
#<environment: 0x2298368>
#<environment: 0x2298368>
When setup()
is called from the command line, notice its parent is .GlobalEnv
, as is the parent frame. However, the parent of environments A-C is the temporary setup
environment (0x2298368). When setup()
completes, its environment closes and is deleted and env.A-C become orphans. At this point (I assume) R re-assigns parentage to .GlobalEnv
and this is why this alternative works.
I think a cleaner way would not to depend on the correct re-assignment to .GlobalEnv
and to specify it directly: env.A <- new.env(parent=.GlobalEnv)
. This works fine in my test case ... we'll see what happens when I scale up to ~750 interdependent functions!
Thanks again for your clear answer, I'd up-vote it but I'm apparently too new to have that privilege.
Upvotes: 0
Reputation: 206606
The different between the two methods has to do with the parent environment of each of the newly created environments.
When R finds an object, it will then try to resolve all variable in that environment. If it cannot find them, it will then look next in the parent environment. It will continue to do so until it gets all the way to the empty environment. So if a function as the global environment as a parent environment, then every environment in the search path will be searched as you were expecting.
When you create an environment with
env.A <- new.env();
the default value for the parent=
parameter is parent.frame()
and as such when you call it it will set the value to the current environment()
value. Observe
parent.env(env.A)
# <environment: R_GlobalEnv>
s a child of the global environment. However, when you do
env.A = attach(NULL, name='env.A')
parent.env(env.A)
# <environment: 0x1089c0ea0>
# attr(,"name")
# [1] "tools:RGUI"
You will see that it sets the parent to the environment in the search path that was last loaded (which happens to be "tools:RGUI" for me after a fresh R restart.) And continuing
env.B = attach(NULL, name='env.B')
parent.env(env.B)
#<environment: 0x108a2edf8>
#attr(,"name")
#[1] "env.A"
env.C = attach(NULL, name='env.C')
parent.env(env.C)
# <environment: 0x108a4f6e0>
# attr(,"name")
# [1] "env.B"
Notice how as we continue to add environments via attach()
, they do not have a parent of GlobalEnv
. This means that once we resolve a variable to env.B
, it does not have a way to go "up the chain" to env.A
. This is why it cannot find f3()
. This is the same as doing
env.A <- new.env(parent=parent.env(globalenv()));
env.B <- new.env(parent=env.A);
env.C <- new.env(parent=env.B);
with explicit calls to new.env
.
Note that if I switch the order of attaches to
env.C = attach(NULL, name='env.C')
env.B = attach(NULL, name='env.B')
env.A = attach(NULL, name='env.A')
and try to run f2()
, this time it can't find f1()
, again because it can only go one way up the chain.
So the two different ways to create environments differ in the way they assign the default parent environment. So perhaps the attach(NULL)
method really isn't appropriate for you in this case.
Upvotes: 1