Marc in the box
Marc in the box

Reputation: 12005

Why is the following function's performance not penalized by growing an additional object?

Below I have three functions that perform the same operation - duplicate a given data.frame and rbind it to itself (i.e. the bad practice of growing an object).

The first function, f1, copies the input object to a new object, x, then grows that object, and finally replaces the input object.

The second function, f2, copies the input object to a new object, x, then grows the input object.

The third function, f3, only grows the input object.

I would have expected f1 to be the slowest given that it essentially requires that the memory allocation is changed for both df and x. To the contrary, all 3 functions seem to be approximately equivalent in their computation time. How can I understand this behavior? Is my example flawed?

## Functions
# copy df, grow copy, replace df with copy
f1 <- function(df){
  x <- df
  x <- rbind(x, x)
  df <- x
  return(df)
}

# copy df, grow df
f2 <- function(df){
  x <- df
  df <- rbind(df, x)
  return(df)
}

# grow df
f3 <- function(df){
  df <- rbind(df, df)
  return(df)
}


## Benchmark
df <- rbind(iris)
res <- microbenchmark(f1(df), f2(df), f3(df), times=5000L)


## Print results:
print(res)

# Unit: microseconds
#   expr    min      lq     mean  median      uq       max neval
# f1(df) 255.66 263.591 292.6851 270.123 292.516  2693.291  5000
# f2(df) 255.66 263.591 302.5159 270.590 292.516 15460.876  5000
# f3(df) 255.66 263.591 299.6157 270.122 292.516  3613.758  5000


## Plot results:
boxplot(res)

enter image description here

Upvotes: 1

Views: 22

Answers (1)

F. Priv&#233;
F. Priv&#233;

Reputation: 11728

R functions use copy-on-modify. So, when you pass df as an argument, if you don't modify it, it will point to the same object that you passed (same address).

Same happens if you assign to the same object. For example, using address <- function(x) cat(data.table::address(x), "\n"):

> x <- 1
> address(x)
0x58164b8 
> y <- x
> address(y)
0x58164b8 

So now, printing some addresses within your functions

## Functions
# copy df, grow copy, replace df with copy
f1 <- function(df){
  address(x <- df)
  address(x <- rbind(x, x))
  address(df <- x)
  return(df)
}

# copy df, grow df
f2 <- function(df){
  address(x <- df)
  address(df <- rbind(df, x))
  return(df)
}

Result:

> df <- rbind(iris)
> address(df)
0x543e378 
> res1 <- f1(df)
0x543e378 
0x5d3bd08 
0x5d3bd08 
> res2 <- f2(df)
0x543e378 
0x5c89e40 

So, each of your functions is creating only one new object, this is why they have the same footprint.

Upvotes: 1

Related Questions