Reputation: 476
I'm trying to get a better understanding of performance in for loops in R. I modified the example from Hadley's book here but I'm still confused.
I have the following set-up, where the for loop goes over several random columns:
set.seed(123)
df <- as.data.frame(matrix(runif(1e3), ncol = 10))
cols <- sample(names(df), 2)
tracemem(df)
I have a for loop that runs for every element of cols
.
for (i in seq_along(cols)) {
df[[cols[i]]] <- 3.2
}
I get the following list of copies.
tracemem[0x1c54040 -> 0x20e1470]:
tracemem[0x20e1470 -> 0x20e17b8]: [[<-.data.frame [[<-
tracemem[0x20e17b8 -> 0x20dc4b8]: [[<-.data.frame [[<-
tracemem[0x20dc4b8 -> 0x20dc800]:
tracemem[0x20dc800 -> 0x20dc8a8]: [[<-.data.frame [[<-
tracemem[0x20dc8a8 -> 0x20dcaa0]: [[<-.data.frame [[<-
Hadley notes in his example:
In fact, each iteration copies the data frame not once, not twice, but three times! Two copies are made by [[.data.frame, and a further copy is made because [[.data.frame is a regular function that increments the reference count of x.
Can someone explain why the [[<-.data.frame
method needs to make two copies?
Upvotes: 1
Views: 83
Reputation: 44788
This isn't really a complete answer to your question, but it's a start.
If you look in the R Language Definition, you'll see that df[["name"]] <- 3.2
is implemented as
`*tmp*` <- df
df <- "[[<-.data.frame"(`*tmp*`, "name", value=3.2)
rm(`*tmp*`)
So one copy gets put into *tmp*
. If you call debug("[[<-.data.frame")
, you'll see that it really does get called with an argument called *tmp*
, and
tracemem()
will show that the first duplication happens before you enter.
The function [[<-.data.frame
is a regular function with a header like this:
function (x, i, j, value)
That function gets called as
`[[<-.data.frame`(`*tmp*`, "name", value = 3.2)
Now there are three references to the dataframe: df
in the global environment, *tmp*
in the internal code, and x
in that function. (Actually, there's an intermediate step where the generic is called, but it is a primitive, so it doesn't need to make a new reference.)
The class of x
gets changed in the function; that triggers a copy. Then one of the components of x
is changed; that's another copy. So that makes 3.
Just guessing, I'd say the reason for the first duplication is that a complicated replacement might refer to the original value, and it's avoiding the possibility of retrieving a partially modified value.
Upvotes: 1