Quentin Geissmann
Quentin Geissmann

Reputation: 2240

Reduce memory footprint of data.table with highly repeated key

I am writing a package to analyse high throughput animal behaviour data in R. The data are multivariate time series. I have chosen to represent them using data.tables, which I find very convenient.

For one animal, I would have something like that:

one_animal_dt <- data.table(t=1:20, x=rnorm(20), y=rnorm(20))

However, my users and I work with many animals having different arbitrary treatments, conditions and other variables that are constant within each animal.

In the end, the most convenient way I found to represent the data was to merge behaviour from all the animals and all the experiments in a single data table, and use extra columns, which I set as key, for each one of these "repeated variables".

So, conceptually, something like that:

animal_list <- list()
animal_list[[1]] <- data.table(t=1:20, x=rnorm(20), y=rnorm(20),
                               treatment="A", date="2017-02-21 20:00:00", 
                               animal_id=1)
animal_list[[2]]  <- data.table(t=1:20, x=rnorm(20), y=rnorm(20),
                                treatment="B", date="2017-02-21 22:00:00",
                                animal_id=2)
# ...
final_dt <- rbindlist(animal_list)
setkeyv(final_dt,c("treatment", "date","animal_id"))

This way makes it very convenient to compute summaries per animal whilst being agnostic about all biological information (treatments and so on).

In practice, we have millions of (rather than 20) consecutive reads for each animal, so the columns we added for convenience contain highly repeated values, which is not memory efficient.

Is there a way to compress this highly redundant key without losing the structure (i.e. the columns) of the table? Ideally, I don't want to force my users to use JOINs themselves.

Upvotes: 14

Views: 3412

Answers (4)

Quentin Geissmann
Quentin Geissmann

Reputation: 2240

Thanks a lot for all your feed back. You encouraged me to answer my own question, so here it is.

I have considered three different data structures:

  • The original one where all the metadata are naively in the same table. See my question.
  • The nested on where a table contains metadata and a special data column which contains one table per animal. See the answer by @ChiPak.
  • Two tables. One for metatadata and one for data. Those two tables can be mapped onto each other using a shared id (key). See the answer by @UweBlock.

The original approach is very convenient. For instance, it is very efficient and simple to write operations between data and metadata alike (as they are in the same table). For instance, creating or altering new metadata, or new data can be done efficiently with :=.

I have investigated the nested approach, and find it elegant, but I was not satisfied with the difficulty and error-proneness of writing statements to perform simple operations such as creating a variable according to a value of a metavariable (see my comment).

I also considered the two tables option very seriously. It is very efficient if the users know how to perform joins (which are quite verbose) and if they can keep the relationship between data and metadata (e.g. if you have several datasets, you need to ensure you have the right metadata for the right data). Ideally, metadata and data should be in the same structrure, just like nested tables are "inside" their unique parent table.

In the end, I tried to take a bit of all three approaches and came with a new data structure that I have put in a package called behavr. Data is internally stored in an instance of a class that derives from data.table, but it also has metadata as an attribute. Data and metadata share the same key. For data, it works as a regular data table e.g. dt[, y2 := y +1] Since metadata is inside the same structure, I could write a function (xmd) to "expand metadata", and implicitly join it. For instance, dt[, y3 := xmd(a_meta_variable) + y] looks up a_meta_variable in the metadata and join it to be used as a regular -- anonymous -- vector. In addition, I transformed a bit the [ operator so we can access the metadata, with [..., meta=T]: dt[meta=T] to see meta data, dt[, new_meta_var := meta_var1+ meta_var2, meta=T] to make new metavariable and dt[id < 10,meta=T] to subset it.

It really is a draft package at the moment, so I would be very happy to have some feed back and contributions! Read more at https://github.com/rethomics/behavr

Upvotes: 3

Mallick Hossain
Mallick Hossain

Reputation: 661

Here are two possibilities (use one, both, or none):

  1. Ensure that all column types are the most memory efficient possible. If you have integers stored as numerics, that can eat up a lot of memory.
  2. Since you don't want your users to have to do joins on their own, write a short function that does the join for them depending on the animals they want. Just put the animal information in one data.table and the treatment information in another data.table and do the merge in the function

First I'll just separate the tables:

# Same code as in question to generate data
animal_list <- list()
animal_list[[1]] <- data.table(t=1:20, x=rnorm(20), y=rnorm(20),
                               treatment="A", date="2017-02-21 20:00:00", 
                               animal_id=1)
animal_list[[2]]  <- data.table(t=1:20, x=rnorm(20), y=rnorm(20),
                                treatment="B", date="2017-02-21 22:00:00",
                                animal_id=2)
# ...
final_dt <- rbindlist(animal_list)

# Separating into treatment and animal data.tables
animals_dt <- unique(final_dt[, .(date), key = animal_id])
treatments_dt <- final_dt[, .(t, x, y, treatment), key = animal_id]

Then here's the function that does the merging for users

get_animals <- function(animal_names) {
     output <- animals_dt[animal_id %in% animal_names] # Getting desired animals
     output <- treatments_dt[output] # merging in treatment details
     return(output)
 }

Edited to use animal_id as the unique identifier instead of treatment. h/t Uwe

Upvotes: 2

Uwe
Uwe

Reputation: 42544

Let's assume, we are a database administrator given the task to implement this efficiently in a SQL database. One of the goals of database normalisation is the reduction of redundancy.

According to OP's description, there are many (about 1 M) of observations per animal (multivariate, longitudinal data) while the number of animals seems to be much smaller.

So, the constant (or invariant) base data of each animal, e.g., treatment, date, should be kept separately from the observations.

animal_id is the key into both tables assuming animal_id is unique (as the name suggests).

(Note that this is the main difference to Mallick's answer who uses treatment as key which is not guaranteed to be unique, i.e., two animals may receive the same treatment, and furthermore increases redundancy.)

Separate tables are memory efficient

For the purpose of demonstration, more realistic "benchmark" data are being created for 10 animals with 1 M observations for each animal:

library(data.table)   # CRAN version 1.10.4 used
# create observations
n_obs <- 1E6L
n_animals <-10L
set.seed(123L)
observations <- data.table(
  animal_id = rep(seq_len(n_animals), each = n_obs),
  t = rep(seq_len(n_obs), n_animals),
  x = rnorm(n_animals * n_obs), 
  y = rnorm(n_animals * n_obs))
# create animal base data
animals = data.table(
  animal_id = seq_len(n_animals),
  treatment = wakefield::string(n_animals),
  date = wakefield::date_stamp(n_animals, random = TRUE))

Here the wakefield package is used to create dummy names and dates. Note that animal_id is of type integer.

> str(observations)
Classes ‘data.table’ and 'data.frame':    10000000 obs. of  4 variables:
 $ animal_id: int  1 1 1 1 1 1 1 1 1 1 ...
 $ t        : int  1 2 3 4 5 6 7 8 9 10 ...
 $ x        : num  -0.5605 -0.2302 1.5587 0.0705 0.1293 ...
 $ y        : num  0.696 -0.537 -3.043 1.849 -1.085 ...
 - attr(*, ".internal.selfref")=<externalptr> 
> str(animals)
Classes ‘data.table’ and 'data.frame':    10 obs. of  3 variables:
 $ animal_id: int  1 2 3 4 5 6 7 8 9 10
 $ treatment:Classes 'variable', 'character'  atomic [1:10] MADxZ9c6fN ymoJHnvrRx ifdtywJ4jU Q7ZRwnQCsU ...
  .. ..- attr(*, "varname")= chr "String"
 $ date     : variable, format: "2017-07-02" "2016-10-02" ...
 - attr(*, ".internal.selfref")=<externalptr>

The combined size is about 240 Mbytes:

> object.size(observations)
240001568 bytes
> object.size(animals)
3280 bytes

Let's take this is a reference and compare with the OP's approach final_dt:

# join both tables to create equivalent of final_dt
joined <- animals[observations, on = "animal_id"]

The size has now nearly doubled (400 Mbytes) which is not memory efficient.

> object.size(joined)
400003432 bytes

Note that no data.table key was set so far. Instead the on parameter was used to specify the column to join on. If we set the key, joins will be speed up and the on parameter can be omitted:

setkey(observations, animal_id)
setkey(animals, animal_id)
joined <- animals[observations] 

How to work with separate tables?

Now, we have demonstrated that it is memory efficient to use two separate tables.

For subsequent analysis, we can aggregate the observations per animal, e.g.,

observations[, .(.N, mean(x), mean(y)), by = animal_id]
    animal_id       N            V2            V3
 1:         1 1000000 -5.214370e-04 -0.0019643145
 2:         2 1000000 -1.555513e-03  0.0002489457
 3:         3 1000000  1.541233e-06 -0.0005317967
 4:         4 1000000  1.775802e-04  0.0016212182
 5:         5 1000000 -9.026074e-04  0.0015266330
 6:         6 1000000 -1.000892e-03  0.0003284044
 7:         7 1000000  1.770055e-04 -0.0018654386
 8:         8 1000000  1.919562e-03  0.0008605261
 9:         9 1000000  1.175696e-03  0.0005042170
10:        10 1000000  1.681614e-03  0.0020562628

and join the aggregates with animals

animals[observations[, .(.N, mean(x), mean(y)), by = animal_id]]
    animal_id  treatment       date       N            V2            V3
 1:         1 MADxZ9c6fN 2017-07-02 1000000 -5.214370e-04 -0.0019643145
 2:         2 ymoJHnvrRx 2016-10-02 1000000 -1.555513e-03  0.0002489457
 3:         3 ifdtywJ4jU 2016-10-02 1000000  1.541233e-06 -0.0005317967
 4:         4 Q7ZRwnQCsU 2017-02-02 1000000  1.775802e-04  0.0016212182
 5:         5 H2M4V9Dfxz 2017-04-02 1000000 -9.026074e-04  0.0015266330
 6:         6 29P3hFxqNY 2017-03-02 1000000 -1.000892e-03  0.0003284044
 7:         7 rBxjewyGML 2017-02-02 1000000  1.770055e-04 -0.0018654386
 8:         8 gQP8cZhcTT 2017-04-02 1000000  1.919562e-03  0.0008605261
 9:         9 0GEOseSshh 2017-07-02 1000000  1.175696e-03  0.0005042170
10:        10 x74yDs2MdT 2017-02-02 1000000  1.681614e-03  0.0020562628

The OP has pointed out that he doesn't want to force his users to use joins themselves. Admittedly, typing animals[observations] takes more keystrokes than final_dt. So, it's up to the OP to decide whether this is worthwhile to save memory, or not.

This result can be filtered, for instance, if we want to compare animals with certain characteristics, e.g.,

animals[observations[, .(.N, mean(x), mean(y)), by = animal_id]][date == as.Date("2017-07-02")]
   animal_id  treatment       date       N           V2           V3
1:         1 MADxZ9c6fN 2017-07-02 1000000 -0.000521437 -0.001964315
2:         9 0GEOseSshh 2017-07-02 1000000  0.001175696  0.000504217

OP's use cases

In this coment, the OP has described some use cases which he wants to see implemenetd transparently for his users:

  • Creation of new columns final_dt[, x2 := 1-x]: As only obervations are involved, this translates directly to observations[, x2 := 1-x].
  • Select using various criteria final_dt[t > 5 & treatment == "A"]: Here columns of both tables are involved. This can be implemented with data.table in different ways (note that the conditions have been amended for the actual sample data):

    animals[observations][t < 5L & treatment %like% "MAD"]
    

    This is analogue to the expected syntax but is slower than the alternative below because here the filter conditions are applied on all rows of the full join.

    The faster alternative is to split up the filter conditions so that observations is filtered before the join to reduce the result set before the filter conditions on base data columns are applied finally:

    animals[observations[t < 5L]][treatment %like% "MAD"]
    

    Note that this looks quite similar to the expected syntax (with one keystroke less).

    If this is deemed unacceptable by the users, the join operation can be hidden in a function:

    # function definition
    filter_dt <- function(ani_filter = "", obs_filter = "") {
      eval(parse(text = stringr::str_interp(
        'animals[observations[${obs_filter}]][${ani_filter}]')))
    }
    
    # called by user
    filter_dt("treatment %like% 'MAD'", "t < 5L")
    
       animal_id  treatment       date t           x          y
    1:         1 MADxZ9c6fN 2017-07-02 1 -0.56047565  0.6958622
    2:         1 MADxZ9c6fN 2017-07-02 2 -0.23017749 -0.5373377
    3:         1 MADxZ9c6fN 2017-07-02 3  1.55870831 -3.0425688
    4:         1 MADxZ9c6fN 2017-07-02 4  0.07050839  1.8488057
    

Using factors to reduce memory footprint

Caveat: Your mileage may vary as the conclusions below depend on the internal representation of integers on your computer and the cardinality of the data. Please, see Matt Dowle's excellent answer concerning this subject.

Mallick has mentioned that memory might get wasted if integers incidentially are stored as numerics. This can be demonstrated:

n <- 10000L
# integer vs numeric vs logical
test_obj_size <- data.table(
  rep(1, n),
  rep(1L, n),
  rep(TRUE, n))

str(test_obj_size)
Classes ‘data.table’ and 'data.frame':    10000 obs. of  3 variables:
 $ V1: num  1 1 1 1 1 1 1 1 1 1 ...
 $ V2: int  1 1 1 1 1 1 1 1 1 1 ...
 $ V3: logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
 - attr(*, ".internal.selfref")=<externalptr>
sapply(test_obj_size, object.size)
   V1    V2    V3 
80040 40040 40040

Note that the numeric vector needs twice as much memory as the integer vector. Therefore, it is good programming practice to always qualify an integer constant with the suffix character L.

Also the memory consumption of character strings can be reduced if they are coerced to factor:

# character vs factor
test_obj_size <- data.table(
  rep("A", n),
  rep("AAAAAAAAAAA", n),
  rep_len(LETTERS, n),
  factor(rep("A", n)),
  factor(rep("AAAAAAAAAAA", n)),
  factor(rep_len(LETTERS, n)))

str(test_obj_size)
Classes ‘data.table’ and 'data.frame':    10000 obs. of  6 variables:
 $ V1: chr  "A" "A" "A" "A" ...
 $ V2: chr  "AAAAAAAAAAA" "AAAAAAAAAAA" "AAAAAAAAAAA" "AAAAAAAAAAA" ...
 $ V3: chr  "A" "B" "C" "D" ...
 $ V4: Factor w/ 1 level "A": 1 1 1 1 1 1 1 1 1 1 ...
 $ V5: Factor w/ 1 level "AAAAAAAAAAA": 1 1 1 1 1 1 1 1 1 1 ...
 $ V6: Factor w/ 26 levels "A","B","C","D",..: 1 2 3 4 5 6 7 8 9 10 ...
 - attr(*, ".internal.selfref")=<externalptr>
sapply(test_obj_size, object.size)
   V1    V2    V3    V4    V5    V6 
80088 80096 81288 40456 40464 41856

Stored as factor, only half of the memory is required.

The same holds for Date and POSIXct classes:

# Date & POSIXct vs factor
test_obj_size <- data.table(
  rep(as.Date(Sys.time()), n),
  rep(as.POSIXct(Sys.time()), n),
  factor(rep(as.Date(Sys.time()), n)),
  factor(rep(as.POSIXct(Sys.time()), n)))

str(test_obj_size)
Classes ‘data.table’ and 'data.frame':    10000 obs. of  4 variables:
 $ V1: Date, format: "2017-08-02" "2017-08-02" "2017-08-02" "2017-08-02" ...
 $ V2: POSIXct, format: "2017-08-02 18:25:55" "2017-08-02 18:25:55" "2017-08-02 18:25:55" "2017-08-02 18:25:55" ...
 $ V3: Factor w/ 1 level "2017-08-02": 1 1 1 1 1 1 1 1 1 1 ...
 $ V4: Factor w/ 1 level "2017-08-02 18:25:55": 1 1 1 1 1 1 1 1 1 1 ...
 - attr(*, ".internal.selfref")=<externalptr>
sapply(test_obj_size, object.size)
   V1    V2    V3    V4 
80248 80304 40464 40480

Note that data.table() refuses to create a column of class POSIXlt as it is stored in 40 bytes instead of 8 bytes.

So, if your application is memory critical it might be worthwhile to consider to use factor where applicable.

Upvotes: 5

CPak
CPak

Reputation: 13581

You should consider using a nested data.frame

library(tidyverse)

Using a toy example where I rbind 4 copies of mtcars

new <- rbind(mtcars,mtcars,mtcars,mtcars) %>% 
         select(cyl,mpg)
object.size(new)
11384 bytes

If we group the data, which you might do for summarizing values, the size increases a bit

grp <- rbind(mtcars,mtcars,mtcars,mtcars)%>% 
         select(cyl,mpg) %>% 
         group_by(cyl)
object.size(grp)    
14272 bytes

If we nest the data as well

alt <- rbind(mtcars,mtcars,mtcars,mtcars) %>% 
         select(cyl,mpg) %>% 
         group_by(cyl) %>% 
         nest(mpg)
object.size(alt)
4360 bytes

You get a significant reduction in the object size.

NOTE You must have many repeating values to save memory in this case; for instance, a nested single copy of mtcars is larger in memory size than a single normal copy of mtcars

-----YOUR CASE-----

alt1 <- final_dt %>%
         group_by(animal_id, treatment, date) %>%
         nest()

would look like

alt1
  animal_id treatment                date              data
1         1         A 2017-02-21 20:00:00 <tibble [20 x 3]>
2         1         B 2017-02-21 22:00:00 <tibble [20 x 3]>

Upvotes: 3

Related Questions