Reputation: 8567
Let’s make a simple dataframe and give it an attribute “foo”:
orig <- data.frame(x1 = 1, x2 = 2)
attr(orig, "foo") <- TRUE
“foo” is there:
attributes(orig)
#> $names
#> [1] "x1" "x2"
#>
#> $class
#> [1] "data.frame"
#>
#> $row.names
#> [1] 1
#>
#> $foo
#> [1] TRUE
But if I reorder the columns, “foo” disappears
new <- orig[, c(2, 1)]
attributes(new)
#> $names
#> [1] "x2" "x1"
#>
#> $class
#> [1] "data.frame"
#>
#> $row.names
#> [1] 1
I could add it back with:
attributes(new) <- utils::modifyList(attributes(orig), attributes(new))
attributes(new)
#> $names
#> [1] "x2" "x1"
#>
#> $class
#> [1] "data.frame"
#>
#> $row.names
#> [1] 1
#>
#> $foo
#> [1] TRUE
But this operation is time consuming. Not in this case because it’s a one-row dataframe, but consider this case with 10,000,000 rows:
orig <- data.frame(x1 = rep(1, 1e7), x2 = rep(2, 1e7))
attr(orig, "foo") <- TRUE
new <- orig[, c(2, 1)]
bench::mark(
test = {
attributes(new) <- utils::modifyList(attributes(orig), attributes(new))
}
)
#> # A tibble: 1 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 test 43.2ms 46.6ms 21.6 38.1MB 14.4
Of course, it doesn't take that much time to make this, but it is much longer than in the first case with one row (which takes only a few microseconds). It seems weird to me that the time needed to add a single attribute to a dataframe increases with the size of the dataframe. Am I missing something? Is there a more efficient way to add a list of "simple" attributes to a large dataframe?
Edit: looking for a solution with base R only
Upvotes: 10
Views: 365
Reputation: 20444
When you create a data frame of n
rows without explicitly declaring the row names, the row names are stored as an integer vector of length 2 of the form c(NA, -n)
.
If you copy the row names attribute from one data frame to another, R evaluates this vector in order to copy it. This should never be done.
Alternatively you could use data.table
or tidyverse
, both of which keep attributes when a copy is made, avoiding the need to copy anything.
Let's create a data frame with 10 rows.
num_rows <- 10
set.seed(0)
dat <- data.frame(
x_char = sample(letters, num_rows),
x_int = sample(1:10, num_rows)
)
Let's look at how it appears in memory. I use a helper function to create a simplified, tree representation of the output of lobstr::sxp(dat)
to show how objects are represented in memory.
library(lobstr)
dat_sxp <- sxp(dat)
get_dat_obj_tree(dat_sxp)
1 dat VECSXP length: 2 mem_addr:0x7
2 ¦--x_char STRSXP length: 10 mem_addr:0x1
3 ¦--x_int INTSXP length: 10 mem_addr:0x2
4 °--_attrib LISTSXP length: 3 mem_addr:0x3
5 ¦--names STRSXP length: 2 mem_addr:0x4
6 ¦--class STRSXP length: 1 mem_addr:0x5
7 °--row.names INTSXP length: 2 mem_addr:0x6
The function replaces the memory addresses with unique integers (i.e. mem_addr:0x1
will remain the address of x_char
every time the real address is looked up, unless the memory location of x_char
actually changes).
We would expect the data to have length 10. But why are the row.names
only length 2? Let's print them:
rownames(dat) # "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"
attr(dat, "row.names") # 1 2 3 4 5 6 7 8 9 10
Clearly these are vectors with length 10. You might notice that one is a character vector and one is an integer vector. This led me down a lot of dead-ends, until I found this comment in the R source code:
## As from R 2.4.0, row.names can be either character or integer.
## row.names() will always return character.
## attr(, "row.names") will return either character or integer.
##
## Do not assume that the internal representation is either, since
## 1L:n is stored as the integer vector c(NA, n) to save space (and
## the C-level code to get/set the attribute makes the appropriate
## translations.
This reminded me of something you often see in reproducible examples:
dput(dat)
# structure(list(x_char = c("e", "i", "n", "z", "w", "b", "j",
# "l", "o", "a"), x_int = c(4L, 3L, 6L, 2L, 7L, 10L, 5L, 8L, 9L,
# 1L)), class = "data.frame", row.names = c(NA, -10L))
We see that row names are indeed represented as a vector of length 2, row.names = c(NA, -10L)
. This is the key to understanding how to avoid the expensive copy operation.
It doesn't. It simply creates a circumstance where you are more likely to copy row names, as attributes are not copied after every operation. R Internals states:
Subsetting (other than by an empty index) generally drops all attributes except names, dim and dimnames which are reset as appropriate.
Let's create a new attribute, foo
, and see what happens in memory:
attr(dat, "foo") <- TRUE
Let's look at the internal representation:
dat_foo_sxp <- sxp(dat)
get_dat_obj_tree(dat_foo_sxp)
1 dat VECSXP length: 2 mem_addr:0x7
2 ¦--x_char STRSXP length: 10 mem_addr:0x1
3 ¦--x_int INTSXP length: 10 mem_addr:0x2
4 °--_attrib LISTSXP length: 4 mem_addr:0x3
5 ¦--names STRSXP length: 2 mem_addr:0x4
6 ¦--class STRSXP length: 1 mem_addr:0x5
7 ¦--row.names INTSXP length: 2 mem_addr:0x6
8 °--foo LGLSXP length: 1 mem_addr:0x8
Nothing has truly changed in memory - the attributes class simply has a new node, of type LGLSXP
, i.e. a logical vector.
Let's re-order the columns.
new <- dat[, c(2,1)]
Although we have selected all the columns, we are essentially subsetting the data by index. Let's look at the nodes of the object in memory:
new_sxp <- sxp(new)
get_dat_obj_tree(new_sxp, "new")
1 new VECSXP length: 2 mem_addr:0x12
2 ¦--x_int INTSXP length: 10 mem_addr:0x2
3 ¦--x_char STRSXP length: 10 mem_addr:0x1
4 °--_attrib LISTSXP length: 3 mem_addr:0x9
5 ¦--names STRSXP length: 2 mem_addr:0x10
6 ¦--class STRSXP length: 1 mem_addr:0x5
7 °--row.names INTSXP length: 2 mem_addr:0x11
This is broadly what we would expect from a lazily-evaluated copy apart from the row.names
, which have not changed but have a new memory address:
integer
column is the same.character
column is the same.names
have a new new location (because it's re-ordered).class
has the same address.row.names
have a new memory address.Perhaps R could have kept the row.names
in the same memory location. After all, we are only subsetting columns, so the number and order of rows is unchanged.
However, and this is why my previous suggestion to pre-allocate the row names was wrong, the fact that there are new row.names
does not significantly affect execution time. R is creating a new integer vector of length 2, regardless of the size of the data. This takes almost no time. It is probably not worth adding logic to the R source to establish whether the rows are the same, in order to avoid such a tiny operation.
It is notable in your example, and the answer by Joris C., that operations take longer if they include attr(new, "row.names") <- attr(dat, "row.names")
, either individually or as part of a larger function call such as utils::modifyList(attributes(dat), attributes(new))
. Let's try the simple way:
attr(new, "row.names") <- attr(dat, "row.names")
get_dat_obj_tree(sxp(new))
1 dat VECSXP length: 2 mem_addr:0x15
2 ¦--x_int INTSXP length: 10 mem_addr:0x2
3 ¦--x_char STRSXP length: 10 mem_addr:0x1
4 °--_attrib LISTSXP length: 3 mem_addr:0x13
5 ¦--names STRSXP length: 2 mem_addr:0x10
6 ¦--class STRSXP length: 1 mem_addr:0x5
7 °--row.names INTSXP length: 2 mem_addr:0x14
There's a new memory address. But the row.names
attribute of new
is still an integer vector of length 2. If we run dput(new)
we will see row.names = c(NA, -10L)
.
So if we are copying an integer vector of length 2 from one place to another, regardless of the size of the data, why is it taking longer with larger data frames? The answer to this is what happens when you run:
attr(new, "row.names") <- attr(dat, "row.names")
This is syntactic sugar for:
new <- `attr<-`(new, "row.names", attr(dat, "row.names"))
Firstly, this means that we are evaluating the row.names
for dat
. Secondly, as R internals notes, with a similar example, a <- `dim<-`(a, c(7, 2))
:
in principle two copies of
a
exist for the duration of the computation
So this may be happening twice.
An easier way to understand this is by printing the right-hand side of that function call.
`attr<-`(new, "row.names", attr(dat, "row.names"))
# <truncated>
# attr(,"row.names")
# [1] 1 2 3 4 5 6 7 8 9 10
By the time the row.names
are stored in new
, the R source code in attrib.c, it is clever enough to restore it to c(NA, n)
form:
INTEGER(val)[0] = NA_INTEGER;
INTEGER(val)[1] = n; // +n: compacted *and* automatic row names
However, the damage is done, the short form c(NA, -10)
row names were fully evaluated, which as you would expect (and have demonstrated) takes more time for longer vectors of row names.
It is possible to avoid this issue in base R, and also with data.table
and tidyverse
packages.
The main point is - do not copy the row names from one data frame to another. The function suggested by Joris C. to copy any attributes that were not copied by the subset operation, rather than copying all attributes, is a good base R solution.
An alternative is to convert the data frame to a data.table
and using data.table::setattr()
to set attributes by reference:
library(data.table)
orig <- data.frame(x1 = 1, x2 = 2)
setDT(orig)
mem_location <- tracemem(orig)
setattr(orig, "foo", TRUE)
tracemem(orig) == mem_location # TRUE
attr(orig, "foo") # TRUE
Additionally, with data.table
you can change the column order by reference so you do not lose the attributes when you reorder the columns:
setcolorder(orig, c(2,1))
attr(orig, "foo") # TRUE
orig
# x2 x1
# 1: 2 1
Similarly, a tibble()
keeps its row.names
attribute when you subset columns:
library(tibble)
set.seed(0)
num_rows <- 10
dat <- tibble(
x_char = sample(letters, num_rows),
x_int = sample(1:10, num_rows)
)
attr(dat, "foo") <- TRUE
new <- dat[,c(2,1)]
attr(new, "foo") # TRUE
I went down several dead-ends with this one, and posted two answers that were not quite right before I understood what was really happening under the hood. But I learned a lot about R in the process. Thanks for asking such an interesting question.
Upvotes: 11
Reputation: 4698
Not a base
solution but this may be useful nonetheless.
collapse
package is useful for fast transformations like this which retains attributes, similar to the data.table
approach above. See a related question here.
orig <- data.frame(x1 = 1, x2 = 2)
attr(orig, "foo") <- TRUE
library(collapse)
new <- collapse::colorderv(orig, c(2, 1))
attributes(new)
# $names
# [1] "x2" "x1"
#
# $class
# [1] "data.frame"
#
# $row.names
# [1] 1
#
# $foo
# [1] TRUE
With larger data this seems quicker compared to the OP approach:
orig <- data.frame(x1 = rep(1, 1e7), x2 = rep(2, 1e7))
attr(orig, "foo") <- TRUE
new <- orig[, c(2, 1)]
microbenchmark::microbenchmark(
test = { attributes(new) <- utils::modifyList(attributes(orig), attributes(new)) },
collapse_approach = {new2 <- collapse::colorderv(orig, c(2, 1))},
times = 100,
unit = "ms"
)
#Unit: milliseconds
# expr min lq mean median uq max neval cld
# test 56.264601 57.3118010 63.78885805 58.7479515 59.443000 310.330801 100 b
#collapse_approach 0.007601 0.0126515 0.03835808 0.0535015 0.055301 0.115201 100 a
Upvotes: 0
Reputation: 6234
The reason the computation time of copying all data.frame
attributes scales with the size of the data.frame
seems to be mainly due to the row.names
attribute.
We can check that copying the row.names
attribute is responsible for most of the computation time:
orig <- data.frame(x1 = rep(1, 1e7), x2 = rep(2, 1e7))
attr(orig, "foo") <- TRUE
new <- orig[, c(2, 1)]
microbenchmark::microbenchmark(
all_attrs = { attributes(new) <- attributes(orig) },
rownames = { attr(new, "row.names") <- attr(orig, "row.names") },
foo = { attr(new, "foo") <- attr(orig, "foo") },
times = 10,
unit = "ms"
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> all_attrs 60.477554 61.18414 64.3562408 61.9978505 67.117645 72.827139 10
#> rownames 59.831147 61.21029 69.6012781 64.2950890 68.880676 106.280348 10
#> foo 0.001043 0.00206 0.0072771 0.0087225 0.011206 0.015295 10
If we compare this to copying the foo
attribute in the case of the small data.frame
, the timing is (roughly) of the same order:
orig <- data.frame(x1 = 1, x2 = 2)
attr(orig, "foo") <- TRUE
new <- orig[, c(2, 1)]
microbenchmark::microbenchmark(
foo = { attr(new, "foo") <- attr(orig, "foo") },
unit = "ms"
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> foo 0.00115 0.00118 0.00146262 0.0012055 0.0012725 0.022368 100
To be efficient you can choose to only copy any custom defined attributes (instead of all data.frame
attributes). For instance:
## replace only custom attributes
replace_attrs <- function(obj, new_attrs) {
for(nm in setdiff(names(new_attrs), names(attributes(data.frame())))) {
attr(obj, which = nm) <- new_attrs[[nm]]
}
return(obj)
}
new <- replace_attrs(new, attributes(orig))
Upvotes: 8