user21359
user21359

Reputation: 476

Why does the extract method for data frames make two copies?

I'm trying to get a better understanding of performance in for loops in R. I modified the example from Hadley's book here but I'm still confused.

I have the following set-up, where the for loop goes over several random columns:

set.seed(123)
df <- as.data.frame(matrix(runif(1e3), ncol = 10))
cols <- sample(names(df), 2)
tracemem(df)

I have a for loop that runs for every element of cols.

  for (i in seq_along(cols)) {
      df[[cols[i]]] <- 3.2
  }

I get the following list of copies.

tracemem[0x1c54040 -> 0x20e1470]: 
tracemem[0x20e1470 -> 0x20e17b8]: [[<-.data.frame [[<- 
tracemem[0x20e17b8 -> 0x20dc4b8]: [[<-.data.frame [[<- 
tracemem[0x20dc4b8 -> 0x20dc800]: 
tracemem[0x20dc800 -> 0x20dc8a8]: [[<-.data.frame [[<- 
tracemem[0x20dc8a8 -> 0x20dcaa0]: [[<-.data.frame [[<- 

Hadley notes in his example:

In fact, each iteration copies the data frame not once, not twice, but three times! Two copies are made by [[.data.frame, and a further copy is made because [[.data.frame is a regular function that increments the reference count of x.

Can someone explain why the [[<-.data.frame method needs to make two copies?

Upvotes: 1

Views: 83

Answers (1)

user2554330
user2554330

Reputation: 44788

This isn't really a complete answer to your question, but it's a start.

If you look in the R Language Definition, you'll see that df[["name"]] <- 3.2 is implemented as

`*tmp*` <- df
df <- "[[<-.data.frame"(`*tmp*`, "name", value=3.2)
rm(`*tmp*`)

So one copy gets put into *tmp*. If you call debug("[[<-.data.frame"), you'll see that it really does get called with an argument called *tmp*, and tracemem() will show that the first duplication happens before you enter.

The function [[<-.data.frame is a regular function with a header like this:

function (x, i, j, value)  

That function gets called as

`[[<-.data.frame`(`*tmp*`, "name", value = 3.2)

Now there are three references to the dataframe: df in the global environment, *tmp* in the internal code, and x in that function. (Actually, there's an intermediate step where the generic is called, but it is a primitive, so it doesn't need to make a new reference.)

The class of x gets changed in the function; that triggers a copy. Then one of the components of x is changed; that's another copy. So that makes 3.

Just guessing, I'd say the reason for the first duplication is that a complicated replacement might refer to the original value, and it's avoiding the possibility of retrieving a partially modified value.

Upvotes: 1

Related Questions