bretauv
bretauv

Reputation: 8567

Why does adding attributes to a dataframe take longer with large dataframes?

Let’s make a simple dataframe and give it an attribute “foo”:

orig <- data.frame(x1 = 1, x2 = 2)
attr(orig, "foo") <- TRUE

“foo” is there:

attributes(orig)
#> $names
#> [1] "x1" "x2"
#> 
#> $class
#> [1] "data.frame"
#> 
#> $row.names
#> [1] 1
#> 
#> $foo
#> [1] TRUE

But if I reorder the columns, “foo” disappears

new <- orig[, c(2, 1)]
attributes(new)
#> $names
#> [1] "x2" "x1"
#> 
#> $class
#> [1] "data.frame"
#> 
#> $row.names
#> [1] 1

I could add it back with:

attributes(new) <- utils::modifyList(attributes(orig), attributes(new))
attributes(new)
#> $names
#> [1] "x2" "x1"
#> 
#> $class
#> [1] "data.frame"
#> 
#> $row.names
#> [1] 1
#> 
#> $foo
#> [1] TRUE

But this operation is time consuming. Not in this case because it’s a one-row dataframe, but consider this case with 10,000,000 rows:

orig <- data.frame(x1 = rep(1, 1e7), x2 = rep(2, 1e7))
attr(orig, "foo") <- TRUE
new <- orig[, c(2, 1)]

bench::mark(
  test = {
    attributes(new) <- utils::modifyList(attributes(orig), attributes(new))
  }
)
#> # A tibble: 1 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 test         43.2ms   46.6ms      21.6    38.1MB     14.4

Of course, it doesn't take that much time to make this, but it is much longer than in the first case with one row (which takes only a few microseconds). It seems weird to me that the time needed to add a single attribute to a dataframe increases with the size of the dataframe. Am I missing something? Is there a more efficient way to add a list of "simple" attributes to a large dataframe?

Edit: looking for a solution with base R only

Upvotes: 10

Views: 365

Answers (3)

SamR
SamR

Reputation: 20444

Row names are stored lazily in a data frame but when copied they are fully evaluated

When you create a data frame of n rows without explicitly declaring the row names, the row names are stored as an integer vector of length 2 of the form c(NA, -n).

If you copy the row names attribute from one data frame to another, R evaluates this vector in order to copy it. This should never be done.

Alternatively you could use data.table or tidyverse, both of which keep attributes when a copy is made, avoiding the need to copy anything.

A closer look at what happens in memory

Let's create a data frame with 10 rows.

num_rows  <- 10
set.seed(0)
dat  <- data.frame(
    x_char = sample(letters, num_rows),
    x_int = sample(1:10, num_rows)
)

Let's look at how it appears in memory. I use a helper function to create a simplified, tree representation of the output of lobstr::sxp(dat) to show how objects are represented in memory.

library(lobstr)
dat_sxp  <- sxp(dat)
get_dat_obj_tree(dat_sxp)
1 dat    VECSXP    length: 2     mem_addr:0x7
2  ¦--x_char    STRSXP    length: 10     mem_addr:0x1
3  ¦--x_int    INTSXP    length: 10     mem_addr:0x2
4  °--_attrib    LISTSXP    length: 3     mem_addr:0x3
5      ¦--names    STRSXP    length: 2     mem_addr:0x4
6      ¦--class    STRSXP    length: 1     mem_addr:0x5
7      °--row.names    INTSXP    length: 2     mem_addr:0x6

The function replaces the memory addresses with unique integers (i.e. mem_addr:0x1 will remain the address of x_char every time the real address is looked up, unless the memory location of x_char actually changes).

We would expect the data to have length 10. But why are the row.names only length 2? Let's print them:

rownames(dat) # "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"
attr(dat, "row.names") #  1  2  3  4  5  6  7  8  9 10

Clearly these are vectors with length 10. You might notice that one is a character vector and one is an integer vector. This led me down a lot of dead-ends, until I found this comment in the R source code:

## As from R 2.4.0, row.names can be either character or integer.
## row.names() will always return character.
## attr(, "row.names") will return either character or integer.
##
## Do not assume that the internal representation is either, since
## 1L:n is stored as the integer vector c(NA, n) to save space (and
## the C-level code to get/set the attribute makes the appropriate
## translations.

This reminded me of something you often see in reproducible examples:

dput(dat)
# structure(list(x_char = c("e", "i", "n", "z", "w", "b", "j",
# "l", "o", "a"), x_int = c(4L, 3L, 6L, 2L, 7L, 10L, 5L, 8L, 9L,
# 1L)), class = "data.frame", row.names = c(NA, -10L))

We see that row names are indeed represented as a vector of length 2, row.names = c(NA, -10L). This is the key to understanding how to avoid the expensive copy operation.

How does creating a new attribute change things?

It doesn't. It simply creates a circumstance where you are more likely to copy row names, as attributes are not copied after every operation. R Internals states:

Subsetting (other than by an empty index) generally drops all attributes except names, dim and dimnames which are reset as appropriate.

Let's create a new attribute, foo, and see what happens in memory:

attr(dat, "foo")  <- TRUE

Let's look at the internal representation:

dat_foo_sxp  <- sxp(dat)
get_dat_obj_tree(dat_foo_sxp)
1 dat    VECSXP    length: 2     mem_addr:0x7
2  ¦--x_char    STRSXP    length: 10     mem_addr:0x1
3  ¦--x_int    INTSXP    length: 10     mem_addr:0x2
4  °--_attrib    LISTSXP    length: 4     mem_addr:0x3
5      ¦--names    STRSXP    length: 2     mem_addr:0x4
6      ¦--class    STRSXP    length: 1     mem_addr:0x5
7      ¦--row.names    INTSXP    length: 2     mem_addr:0x6
8      °--foo    LGLSXP    length: 1     mem_addr:0x8

Nothing has truly changed in memory - the attributes class simply has a new node, of type LGLSXP, i.e. a logical vector.

What happens when we subset the data frame?

Let's re-order the columns.

new  <- dat[, c(2,1)]

Although we have selected all the columns, we are essentially subsetting the data by index. Let's look at the nodes of the object in memory:

new_sxp  <- sxp(new)
get_dat_obj_tree(new_sxp, "new")
1 new    VECSXP    length: 2     mem_addr:0x12
2  ¦--x_int    INTSXP    length: 10     mem_addr:0x2
3  ¦--x_char    STRSXP    length: 10     mem_addr:0x1
4  °--_attrib    LISTSXP    length: 3     mem_addr:0x9
5      ¦--names    STRSXP    length: 2     mem_addr:0x10
6      ¦--class    STRSXP    length: 1     mem_addr:0x5
7      °--row.names    INTSXP    length: 2     mem_addr:0x11

This is broadly what we would expect from a lazily-evaluated copy apart from the row.names, which have not changed but have a new memory address:

  1. The data frame itself has a new memory address.
  2. The memory address of the integer column is the same.
  3. The memory address of the character column is the same.
  4. The attributes pairlist has a new memory address.
  5. The names have a new new location (because it's re-ordered).
  6. The class has the same address.
  7. The row.names have a new memory address.

Perhaps R could have kept the row.names in the same memory location. After all, we are only subsetting columns, so the number and order of rows is unchanged.

However, and this is why my previous suggestion to pre-allocate the row names was wrong, the fact that there are new row.names does not significantly affect execution time. R is creating a new integer vector of length 2, regardless of the size of the data. This takes almost no time. It is probably not worth adding logic to the R source to establish whether the rows are the same, in order to avoid such a tiny operation.

So why does the example in the question take longer with larger data frames?

It is notable in your example, and the answer by Joris C., that operations take longer if they include attr(new, "row.names") <- attr(dat, "row.names"), either individually or as part of a larger function call such as utils::modifyList(attributes(dat), attributes(new)). Let's try the simple way:

attr(new, "row.names") <- attr(dat, "row.names")
get_dat_obj_tree(sxp(new))
1 dat    VECSXP    length: 2     mem_addr:0x15
2  ¦--x_int    INTSXP    length: 10     mem_addr:0x2
3  ¦--x_char    STRSXP    length: 10     mem_addr:0x1
4  °--_attrib    LISTSXP    length: 3     mem_addr:0x13
5      ¦--names    STRSXP    length: 2     mem_addr:0x10
6      ¦--class    STRSXP    length: 1     mem_addr:0x5
7      °--row.names    INTSXP    length: 2     mem_addr:0x14

There's a new memory address. But the row.names attribute of new is still an integer vector of length 2. If we run dput(new) we will see row.names = c(NA, -10L).

So if we are copying an integer vector of length 2 from one place to another, regardless of the size of the data, why is it taking longer with larger data frames? The answer to this is what happens when you run:

attr(new, "row.names") <- attr(dat, "row.names")

This is syntactic sugar for:

new  <- `attr<-`(new, "row.names", attr(dat, "row.names"))

Firstly, this means that we are evaluating the row.names for dat. Secondly, as R internals notes, with a similar example, a <- `dim<-`(a, c(7, 2)):

in principle two copies of a exist for the duration of the computation

So this may be happening twice.

Where is the evaluation happening?

An easier way to understand this is by printing the right-hand side of that function call.

`attr<-`(new, "row.names", attr(dat, "row.names")) 
# <truncated>
# attr(,"row.names")
#  [1]  1  2  3  4  5  6  7  8  9 10

By the time the row.names are stored in new, the R source code in attrib.c, it is clever enough to restore it to c(NA, n) form:

INTEGER(val)[0] = NA_INTEGER;
INTEGER(val)[1] = n; // +n:  compacted *and* automatic row names

However, the damage is done, the short form c(NA, -10) row names were fully evaluated, which as you would expect (and have demonstrated) takes more time for longer vectors of row names.

Solutions

It is possible to avoid this issue in base R, and also with data.table and tidyverse packages.

base R solution

The main point is - do not copy the row names from one data frame to another. The function suggested by Joris C. to copy any attributes that were not copied by the subset operation, rather than copying all attributes, is a good base R solution.

data.table solution

An alternative is to convert the data frame to a data.table and using data.table::setattr() to set attributes by reference:

library(data.table)
orig <- data.frame(x1 = 1, x2 = 2)
setDT(orig)

mem_location  <- tracemem(orig)

setattr(orig, "foo", TRUE)

tracemem(orig) == mem_location # TRUE

attr(orig, "foo") # TRUE

Additionally, with data.table you can change the column order by reference so you do not lose the attributes when you reorder the columns:

setcolorder(orig, c(2,1))
attr(orig, "foo") # TRUE

orig
#    x2 x1
# 1:  2  1

tidyverse solution

Similarly, a tibble() keeps its row.names attribute when you subset columns:

library(tibble)

set.seed(0)
num_rows  <- 10
dat  <- tibble(
    x_char = sample(letters, num_rows),
    x_int = sample(1:10, num_rows)
) 

attr(dat, "foo")  <- TRUE

new  <- dat[,c(2,1)]

attr(new, "foo") # TRUE

I went down several dead-ends with this one, and posted two answers that were not quite right before I understood what was really happening under the hood. But I learned a lot about R in the process. Thanks for asking such an interesting question.

Upvotes: 11

user63230
user63230

Reputation: 4698

Not a base solution but this may be useful nonetheless.

collapse package is useful for fast transformations like this which retains attributes, similar to the data.table approach above. See a related question here.

orig <- data.frame(x1 = 1, x2 = 2)
attr(orig, "foo") <- TRUE

library(collapse)
new <- collapse::colorderv(orig, c(2, 1)) 
attributes(new)
# $names
# [1] "x2" "x1"
# 
# $class
# [1] "data.frame"
# 
# $row.names
# [1] 1
# 
# $foo
# [1] TRUE

With larger data this seems quicker compared to the OP approach:

orig <- data.frame(x1 = rep(1, 1e7), x2 = rep(2, 1e7))
attr(orig, "foo") <- TRUE
new <- orig[, c(2, 1)]
microbenchmark::microbenchmark(
  test = { attributes(new) <- utils::modifyList(attributes(orig), attributes(new)) },
  collapse_approach = {new2 <- collapse::colorderv(orig, c(2, 1))},
  times = 100,
  unit = "ms"  
)
#Unit: milliseconds
#             expr       min         lq        mean     median        uq        max neval cld
#             test 56.264601 57.3118010 63.78885805 58.7479515 59.443000 310.330801   100   b
#collapse_approach  0.007601  0.0126515  0.03835808  0.0535015  0.055301   0.115201   100  a 

Upvotes: 0

Joris C.
Joris C.

Reputation: 6234

The reason the computation time of copying all data.frame attributes scales with the size of the data.frame seems to be mainly due to the row.names attribute.

We can check that copying the row.names attribute is responsible for most of the computation time:

orig <- data.frame(x1 = rep(1, 1e7), x2 = rep(2, 1e7))
attr(orig, "foo") <- TRUE
new <- orig[, c(2, 1)]

microbenchmark::microbenchmark(
  all_attrs = { attributes(new) <- attributes(orig) },
  rownames = { attr(new, "row.names") <- attr(orig, "row.names") },
  foo = { attr(new, "foo") <- attr(orig, "foo") },
  times = 10,
  unit = "ms"  
)
#> Unit: milliseconds
#>       expr       min       lq       mean     median        uq        max neval
#>  all_attrs 60.477554 61.18414 64.3562408 61.9978505 67.117645  72.827139    10
#>   rownames 59.831147 61.21029 69.6012781 64.2950890 68.880676 106.280348    10
#>        foo  0.001043  0.00206  0.0072771  0.0087225  0.011206   0.015295    10

If we compare this to copying the foo attribute in the case of the small data.frame, the timing is (roughly) of the same order:

orig <- data.frame(x1 = 1, x2 = 2)
attr(orig, "foo") <- TRUE
new <- orig[, c(2, 1)]

microbenchmark::microbenchmark(
  foo = { attr(new, "foo") <- attr(orig, "foo") },
  unit = "ms"
)
#> Unit: milliseconds
#>  expr     min      lq       mean    median        uq      max neval
#>   foo 0.00115 0.00118 0.00146262 0.0012055 0.0012725 0.022368   100

To be efficient you can choose to only copy any custom defined attributes (instead of all data.frame attributes). For instance:

## replace only custom attributes
replace_attrs <- function(obj, new_attrs) {
  for(nm in setdiff(names(new_attrs), names(attributes(data.frame())))) {
    attr(obj, which = nm) <- new_attrs[[nm]]
  }
  return(obj)
}

new <- replace_attrs(new, attributes(orig))

Upvotes: 8

Related Questions