peny
peny

Reputation: 37

Convert hourly data to daily and bi-daily in groups of elements

I know that this question is not new, but my case includes some characteristics that the previous replies cannot fully address.

I have a very big dataframe in R called 'df' (incl. 14 million elements) with the following format:

            ID               datetime    measurem
     1:    1459   2013-01-08 00:00:00        2.24
     2:    1459   2013-01-08 01:00:00        2
     3:    1459   2013-01-08 02:00:00        2.54
     4:    1459   2013-01-08 03:00:00        3.98
     5:    1459   2013-01-08 04:00:00        2
     6:    1459   2013-01-08 05:00:00        2
     7:    1459   2013-01-08 06:00:00        3
             ....
  1007:    2434   2013-01-08 00:00:00        3.45
  1008:    2434   2013-01-08 01:00:00        3
  1009:    2434   2013-01-08 02:00:00        4
  1010:    2434   2013-01-08 03:00:00        5.01
  1011:    2434   2013-01-08 04:00:00        4
            ....
  3245:    4780   2013-01-10 00:00:00        3
  3246:    4780   2013-01-10 01:00:00        4.73
  3247:    4780   2013-01-10 02:00:00        3

The structure of df is the following:

Classes ‘data.table’ and 'data.frame': 14103024 obs. of 3 variables: $ ID: chr "1459" "1459" ... $ datetime : POSIXct, format: "2013-01-08 00:00:00" "2013-01-08 01:00:00" ... $ measurem: num 2.24 2 2.54 ...

I would like to convert the energy data 'measurem' first to daily taking the sum and then to bi-daily (one measurement until 12am and the other one until 12pm), while keeping the ID column and the date. As the full dataframe is too big, I would appreciate any suggestion that could work relatively fast.

Thank you in advance!

Upvotes: 0

Views: 1723

Answers (2)

Uwe
Uwe

Reputation: 42544

The OP has requested any suggestion that could work relatively fast as the production data set contains 14 M rows.

Unfortunately, the accepted answer by PKumar is rather inefficient in terms of speed and memory consumption:

  • it creates a number of helper vectors which are added as new columns to df, thus being stored twice
  • each update of df copies the whole object
  • the data.table solution does not use data.table syntax to avoid the copy operations
  • POSIXlt takes 52 bytes to store one date-time instance while POSIXct only requires 8 bytes

This is my suggestion to use data.table:

# create sample data, see function definition below
df <- create_sample_data(n_id = 4L, n_hr = 24L * 2L)
str(df)
'data.frame': 192 obs. of  3 variables:
 $ ID      : chr  "000001" "000001" "000001" "000001" ...
 $ datetime: POSIXct, format: "2013-01-08 00:00:00" "2013-01-08 01:00:00" "2013-01-08 02:00:00" ...
 $ measurem: num  1.207 0.277 1.084 2.346 0.429 ...
library(data.table)
# daily aggregates
setDT(df)[, .(sum_measurem = sum(measurem)), 
          by = .(ID, date = as.IDate(datetime))]
       ID       date sum_measurem
1: 000001 2013-01-08     18.01187
2: 000001 2013-01-09     22.53423
3: 000002 2013-01-08     21.77239
4: 000002 2013-01-09     15.57561
5: 000003 2013-01-08     14.79938
6: 000003 2013-01-09     20.09797
7: 000004 2013-01-08     15.21066
8: 000004 2013-01-09     25.47120
# bi-daily aggregates
setDT(df)[, .(sum_measurem = sum(measurem)), 
          by = .(ID, date = as.IDate(datetime), AM = hour(datetime) <= 12L)]
        ID       date    AM sum_measurem
 1: 000001 2013-01-08  TRUE    10.677509
 2: 000001 2013-01-08 FALSE     7.334362
 3: 000001 2013-01-09  TRUE    12.456765
 4: 000001 2013-01-09 FALSE    10.077470
 5: 000002 2013-01-08  TRUE    12.099480
 6: 000002 2013-01-08 FALSE     9.672908
 7: 000002 2013-01-09  TRUE     8.672189
 8: 000002 2013-01-09 FALSE     6.903426
 9: 000003 2013-01-08  TRUE     8.976965
10: 000003 2013-01-08 FALSE     5.822411
11: 000003 2013-01-09  TRUE    11.131718
12: 000003 2013-01-09 FALSE     8.966252
13: 000004 2013-01-08  TRUE     8.413315
14: 000004 2013-01-08 FALSE     6.797342
15: 000004 2013-01-09  TRUE    15.111185
16: 000004 2013-01-09 FALSE    10.360017

Data

create_sample_data <- function(n_id, n_hr) {
  set.seed(1234L)
  data.frame(
    ID = rep(sprintf("%06i", seq_len(n_id)), each = n_hr),
    datetime = rep(seq(as.POSIXct("2013-01-08"), length.out = n_hr, by = "1 hour"), n_id),
    measurem = abs(rnorm(n_id * n_hr)),
    stringsAsFactors = FALSE
    )
}

Benchmark

For benchmarking, sample data are created for 100 unique IDs and 365 days of hourly data each resulting in a sample data set of 876 K rows. As some solutions modify the data set, copy() is used to provide an undisturbed data set for each run. copy() is timed as well.

df0 <- create_sample_data(n_id = 100L, n_hr = 24L * 365L)

microbenchmark::microbenchmark(
  copy = df <- copy(df0),
  uwe_dt = {
    df <- copy(df0)
    setDT(df)[, .(sum_measurem = sum(measurem)), 
              by = .(ID, date = as.IDate(datetime), AM = hour(datetime) < 12L)]
  },
  PKumar_dt = {
    df <- copy(df0)
    datetime <- as.POSIXlt(df$datetime)
    date <- as.Date(datetime)
    ind <- ifelse(datetime$hour >= 12,"PM","AM")
    df$ind <- ind
    df$date <- date
    dt <- setDT(df)
    dt[,list(sum_measure = sum(measurem)),by=list(ID,date,ind)]
  },
  PKumar_baseR = {
    df <- copy(df0)
    datetime <- as.POSIXlt(df$datetime)
    date <- as.Date(datetime)
    ind <- ifelse(datetime$hour >= 12,"PM","AM")
    df$ind <- ind
    df$date <- date
    fin <- aggregate(measurem ~ ID + date + ind, data = df, sum)
    fin[order(fin$ID),]
  },
  times = 11L
)
Unit: milliseconds
         expr        min          lq        mean      median          uq         max neval
         copy    3.94761    4.391457    5.169909    5.537982    5.864401    5.997876    11
       uwe_dt  271.89460  301.001006  339.913084  312.151541  344.251971  540.018306    11
    PKumar_dt  417.57141  464.778485  575.547756  475.562955  689.848696  851.180584    11
 PKumar_baseR 6356.93567 6707.847607 6896.174857 6863.069477 6903.442520 8112.316770    11

Even with that moderately sized problem, the base R solution is a magnitude slower than the data.table versions. The inefficient data manipulations in PKumar's data.table solution add a 50% performance penalty. In addition, 56 MB of additional memory is unnecessarily allocated while df onyl requires 17 MB.

Upvotes: 1

PKumar
PKumar

Reputation: 11128

If I understood you correctly, then I guess you want to summarise the "measurem" column basis on ID,date and AM/PM, Since there is no sample data in the question,I have made my own to make the solution:

DATA:

 set.seed(1234)
df <- data.frame(ID=rep(1:5,4),datetime=c("2013-01-08 00:00:00", "2013-01-09 01:00:00", "2013-01-09 13:00:00", "2013-01-08 02:00:00", "2013-01-08 15:00:00",
                                         "2013-01-08 16:00:00", "2013-01-09 01:00:00", "2013-01-09 02:00:00", "2013-01-08 03:00:00", "2013-01-09 18:00:00",
                                         "2013-01-08 14:00:00", "2013-01-09 19:00:00", "2013-01-08 11:00:00", "2013-01-09 10:00:00", "2013-01-08 18:00:00",
                                         "2013-01-09 19:00:00", "2013-01-09 03:00:00", "2013-01-09 02:00:00", "2013-01-09 21:00:00",
                                         "2013-01-09 11:00:00"),measurement=abs(rnorm(20)))

Solution:

datetime <- as.POSIXlt(df$datetime)
date <- as.Date(datetime)
ind <- ifelse(datetime$hour >= 12,"PM","AM")
df$ind <- ind
df$date <- date

1) data.table way:

library(data.table)
dt <- setDT(df)
dt[,list(count = .N,sum_measure = sum(measurement)),by=list(ID,date,ind)]

2) Base R way:

fin <- aggregate(measurement ~ ID + ind + date,data=df,sum)
fin[order(fin$ID),]



 ID ind       date measurement
#  1  AM 2013-01-08  1.20706575
#  1  PM 2013-01-08  0.98324859
#  1  PM 2013-01-09  0.11028549
#  2  AM 2013-01-09  1.36317871
#  2  PM 2013-01-09  0.99838644
#  3  AM 2013-01-08  0.77625389
#  3  AM 2013-01-09  1.45782727
#  3  PM 2013-01-09  1.08444118
#  4  AM 2013-01-08  2.91014970
#  4  AM 2013-01-09  0.06445882
#  4  PM 2013-01-09  0.83717168
#  5  PM 2013-01-08  1.38861875
#  5  AM 2013-01-09  2.41583518
#  5  PM 2013-01-09  0.89003783

Upvotes: 1

Related Questions