R data.frame memory usage -- especially after subsetting

Question

A colleague of mine is interested in how much overhead (in terms of memory) is in an R data.frame. He uses the following example:

n = 1e6
df = data.frame(v1=rnorm(n), v2 = rnorm(n))
object.size(df)-sum(sapply(df, object.size)) # some overhead in the df

#>696 bytes

All well and good. Now, let's take a random subset of this data.frame

idx = as.logical(rbinom(n, size = 1, prob = 0.99))
df0 = df[idx,]

So - the overhead in df0 should be the same as df right?

object.size(df0) - sum(sapply(df0, object.size))
#> 3961136 bytes

To quote the Grinch "Wrongo!" It seems to me that information about the subsetting variable is being stored here, because if I change this to:

object.size(df0) - sum(sapply(df0, object.size)) - object.size(idx[idx])

Then I get 648 bytes which is almost right. However, I cannot see where this information is being stored.

Roland · Accepted Answer

It's the row names. The original data.frame got implicit row names.

.row_names_info(df, type = 0)
#[1]       NA -1000000

.row_names_info(df0, type = 0)
# [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27
#[28]  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  45  46  47  48  49  50  51  52  53  54  55  56
#[55]  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83
#

object.size(.row_names_info(df0, type = 0))
#3960008 bytes

R data.frame memory usage -- especially after subsetting

Answers (1)

Related Questions