Aggregate timeseries to length/N points

Question

I have timeseries of different length (1 to 14 days usually) with interval of 15 seconds between samples. I need to keep only N point of all that data for every group using aggregation with some predefined function (median, min, max, etc). The reason - I want to show it on plot and too many points make a mess on in, better to split the data and show median, or min/max for short intervals.

The problem is if I use lubridate ceiling_date function for aggregation I am really limited with aggregation periods. It supports only "N hours" or "N mins" format, not even "75m" or "1500s" or "1 hours 5 mins".

But what I really need - to divide length of my data by N and calculate aggregation interval in seconds. Let's say my data length is 8.68 days = 8.682460*60 = 749952 seconds. Let's say I want to have 200 points. My aggregation period should be 749952/200 = 3749,76 ~ 3750 seconds. But I have to use "2 hours" instead of it.

That's my code with example data:

library(dplyr)
library(lubridate)

set.seed(900)

data1 <- 
data.frame(
datetime = seq.POSIXt(as.POSIXct("2020-12-26 10:00:00"), as.POSIXct("2020-12-26 10:00:00") + 15*50001, "15 sec"),
Value = sample(1:100, 50002, replace = T),
Instance = "C:"
)

data2 <- data.frame(
  datetime = seq.POSIXt(as.POSIXct("2020-12-26 10:00:00"), as.POSIXct("2020-12-26 10:00:00") + 15*50001, "15 sec"),
  Value = sample(1:100, 50002, replace = T),
  Instance = "D:"
)

data <- rbind (data1, data2) %>% arrange(datetime)

data_lenght <-
  difftime(max(data$datetime), min(data$datetime), units = "secs")

agg_interval <- data_lenght / 200


if (agg_interval > 3600) {
  N_hours <- ceiling(agg_interval / 60 / 60)
  agg_period <- paste0(N_hours, " hours")
} else {
  N_minutes <- ceiling(agg_interval / 60)
  agg_period <- paste0(N_minutes, " mins")
}

agg_data <-
  data %>%  group_by(across(-c(Value, datetime)),  datetime = ceiling_date (datetime, agg_period)) %>%
  summarise (Value = median(Value) , .groups = "drop")

Result:

# A tibble: 212 x 3
   Instance datetime            Value
                     
 1 C:       2020-12-26 10:00:00  85  
 2 C:       2020-12-26 12:00:00  53  
 3 C:       2020-12-26 14:00:00  48.5
 4 C:       2020-12-26 16:00:00  50  
 5 C:       2020-12-26 18:00:00  52  
 6 C:       2020-12-26 20:00:00  50.5
 7 C:       2020-12-26 22:00:00  51  
 8 C:       2020-12-27 00:00:00  48  
 9 C:       2020-12-27 02:00:00  47  
10 C:       2020-12-27 04:00:00  47  
# ... with 202 more rows

The other problem, later in my code I need to convert aggregation period into seconds. But it's text: "15 mins", "55 mins", "3 hours", etc. Really hard to work with it.

Is there a way to do aggregation more simple way than me using seconds as integer, not text like I do?

Upd: If I try to use interval in seconds:

agg_interval <- round (data_lenght / 200 / 15) * 15

agg_period <- paste0(agg_interval, " secs")

agg_data <-
  data %>%  group_by(across(-c(Value, datetime)),  datetime = ceiling_date (datetime, agg_period)) %>%
  summarise (Value = median(Value) , .groups = "drop")

The result is - no aggregation happen:

# A tibble: 50,004 x 3
   Instance datetime            Value
                     
 1 C:       2020-12-26 10:00:00    85
 2 C:       2020-12-26 10:01:00    19
 3 C:       2020-12-26 10:02:00    43
 4 C:       2020-12-26 10:03:00    83
 5 C:       2020-12-26 10:04:00    67
 6 C:       2020-12-26 10:05:00    28
 7 C:       2020-12-26 10:06:00    54
 8 C:       2020-12-26 10:07:00    28
 9 C:       2020-12-26 10:08:00    99
10 C:       2020-12-26 10:09:00    54
# ... with 49,994 more rows

G. Grothendieck · Accepted Answer

cut.POSIXt can be used like this allowing an arbitrary number of seconds.

secs <- 7200
as.POSIXt(cut(data$datetime, paste(secs, "secs")) + secs

Checking we have:

identical(cut(data$datetime, "7200 secs"), cut(data$datetime, "2 hours"))
## [1] TRUE

As you have undoubtedly noticed, unfortunately this does not work with ceiling_date:

identical(ceiling_date(data$datetime, "2 hours"), 
  ceiling_date(data$datetime, "7200 secs"))
## [1] FALSE

Example

secs <- 3750
agg_period <- paste(secs, "secs")

agg_data <- data %>%  
    group_by(across(-c(Value, datetime)),  
      datetime = as.POSIXct(cut(datetime, agg_period)) + secs) %>%
    summarise (Value = median(Value) , .groups = "drop")

dim(agg_data)
## [1] 402   3

Aggregate timeseries to length/N points

Answers (2)

Example

Related Questions