Reputation: 11597
I am reading Hadley's Advanced R Programming and when it discusses the memory size for characters it says this:
R has a global string pool. This means that each unique string is only stored in one place, and therefore character vectors take up less memory than you might expect.
The example the book gives is this:
library(pryr)
object_size("banana")
#> 96 B
object_size(rep("banana", 10))
#> 216 B
One of the exercises in this section is to compare these two character vectors:
vec <- lapply(0:50, function(i) c("ba", rep("na", i)))
str <- lapply(vec, paste0, collapse = "")
object_size(vec)
13.4 kB
object_size(str)
8.74 kB
Now, since the passage states that R has a global string pool, and since vector vec
is composed mainly of repetitions of two strings ("ba" and "na") I actually would - intuitively - expect the size of vec
to be smaller than the size of str
.
So my question is: how could you most accurately estimate the size of those vectors beforehand?
Upvotes: 10
Views: 1497
Reputation: 1061
The key difference is because of the pointers in vec
: each of the short scalar strings (CHARSXPs) has to be pointed from the corresponding string vector (STRSXP). You have some 1326 of such string pointers inside vec
, but only 51 in str
(a pointer is probably 8 bytes on your platform). The pool is for scalar strings (aka CHARSXP cache). Another non-obvious factor is internal fragmentation, e.g. on my system, a scalar string takes the same size regardless of whether it has zero to 7 characters, an 8 character string only takes more, and so on. See the repeated sizes in the following:
unlist(sapply(str, object.size))
[1] 96 96 96 104 104 104 104 120 120 120 120 120 120 120 120 136 136 136 136
[20] 136 136 136 136 152 152 152 152 152 152 152 152 216 216 216 216 216 216 216
[39] 216 216 216 216 216 216 216 216 216 216 216 216 216
These are, however, implementation details of R's memory manager that could change and one should not depend on them in any way in user programs - with another object layout/memory manager, str
could use more space than vec
.
Upvotes: 3