Cauder
Cauder

Reputation: 2597

Is there a fast way to turn a data table into a list of data tables?

I have a data table with this schema:

date, key_a, key_b, key_c, key_d, value

I'd like to make a list of lists that has this structure:

[1]
date, key_a, value

[2] 
date, key_b, value

[3]
date, key_c, value

I'd like to aggregate my dt into a list where each of the entries is aggregating the dt across a separate one of the keys.

This is my code

setDT(dt)

list_of_dts[1] <- 
    dt[, .(value = sum(value)), .(date, key_a)]

list_of_dts[2] <- 
    dt[, .(value = sum(value)), .(date, key_b)]

(So on)

Is there a more efficient way of solving this?

Upvotes: 2

Views: 245

Answers (3)

akrun
akrun

Reputation: 887251

We can use tidyverse

library(dplyr)
library(tidyr)
dt %>%
    pivot_longer(cols = starts_with('key'), values_to = 'value1') %>%
    group_by(date, name) %>%
     summarise(value = sum(value1))

An option is to convert to disk.frame and do the group by operations

library(disk.frame)
dt %>%
    pivot_longer(cols = starts_with('key'), values_to = 'value1') %>%
    as.disk.frame %>%
    group_by(date, name) %>%
    summarise(value = sum(value1)) %>%
    collect()

When there are multiple .csv files it can be directly read with csv_to_disk.frame

df  <- csv_to_disk.frame(file.path(tempdir(), "df.csv"), 
  inmapfn = function(chunk) {
   # convert to date_str to date format and store as "date"
       chunk[, date := as.Date(date_str, "%Y-%m-%d")]
       chunk[, date_str:= NULL]
       chunk[, new  := col1 + 5]
    })

Upvotes: 1

r2evans
r2evans

Reputation: 160447

Try this data.table-native attempt:

dt <- data.table(date=c(1,1,2), key_a=c(11,11,13), key_b=c(21,21,23), key_c=c(31,31,33), key_d=c(41,41,43), value=c(51,51,53))
keynames <- grep("^key", colnames(dt), value = TRUE)
othnames <- setdiff(colnames(dt), keynames)
keynames
# [1] "key_a" "key_b" "key_c" "key_d"
othnames
# [1] "date"  "value"

The split, for raw data:

lapply(setNames(nm = keynames), function(kn) subset(dt, select = c(othnames, kn)))
# $key_a
#    date value key_a
# 1:    1    51    11
# 2:    1    51    11
# 3:    2    53    13
# $key_b
#    date value key_b
# 1:    1    51    21
# 2:    1    51    21
# 3:    2    53    23
# $key_c
#    date value key_c
# 1:    1    51    31
# 2:    1    51    31
# 3:    2    53    33
# $key_d
#    date value key_d
# 1:    1    51    41
# 2:    1    51    41
# 3:    2    53    43

Or for your aggregation issue:

lapply(setNames(nm = keynames), function(kn) dt[, .(value = sum(value)), by = c("date", kn)])
# $key_a
#    date key_a value
# 1:    1    11   102
# 2:    2    13    53
# $key_b
#    date key_b value
# 1:    1    21   102
# 2:    2    23    53
# $key_c
#    date key_c value
# 1:    1    31   102
# 2:    2    33    53
# $key_d
#    date key_d value
# 1:    1    41   102
# 2:    2    43    53

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 389055

Maybe get the data in long format and then summarise it :

library(data.table)
setDT(dt)
dt1 <- melt(dt, id.vars = c('date', 'value'))
dt1 <- dt1[, .(value = sum(value)), .(date, variable)]

Now if you want list of dataframes you can use split :

split(dt1, dt1$variable)

Upvotes: 3

Related Questions