Maxim
Maxim

Reputation: 281

Aggregate timeseries to length/N points

I have timeseries of different length (1 to 14 days usually) with interval of 15 seconds between samples. I need to keep only N point of all that data for every group using aggregation with some predefined function (median, min, max, etc). The reason - I want to show it on plot and too many points make a mess on in, better to split the data and show median, or min/max for short intervals.

The problem is if I use lubridate ceiling_date function for aggregation I am really limited with aggregation periods. It supports only "N hours" or "N mins" format, not even "75m" or "1500s" or "1 hours 5 mins".

But what I really need - to divide length of my data by N and calculate aggregation interval in seconds. Let's say my data length is 8.68 days = 8.682460*60 = 749952 seconds. Let's say I want to have 200 points. My aggregation period should be 749952/200 = 3749,76 ~ 3750 seconds. But I have to use "2 hours" instead of it.

That's my code with example data:

library(dplyr)
library(lubridate)

set.seed(900)

data1 <- 
data.frame(
datetime = seq.POSIXt(as.POSIXct("2020-12-26 10:00:00"), as.POSIXct("2020-12-26 10:00:00") + 15*50001, "15 sec"),
Value = sample(1:100, 50002, replace = T),
Instance = "C:"
)

data2 <- data.frame(
  datetime = seq.POSIXt(as.POSIXct("2020-12-26 10:00:00"), as.POSIXct("2020-12-26 10:00:00") + 15*50001, "15 sec"),
  Value = sample(1:100, 50002, replace = T),
  Instance = "D:"
)

data <- rbind (data1, data2) %>% arrange(datetime)

data_lenght <-
  difftime(max(data$datetime), min(data$datetime), units = "secs")

agg_interval <- data_lenght / 200


if (agg_interval > 3600) {
  N_hours <- ceiling(agg_interval / 60 / 60)
  agg_period <- paste0(N_hours, " hours")
} else {
  N_minutes <- ceiling(agg_interval / 60)
  agg_period <- paste0(N_minutes, " mins")
}

agg_data <-
  data %>%  group_by(across(-c(Value, datetime)),  datetime = ceiling_date (datetime, agg_period)) %>%
  summarise (Value = median(Value) , .groups = "drop")

Result:

# A tibble: 212 x 3
   Instance datetime            Value
   <chr>    <dttm>              <dbl>
 1 C:       2020-12-26 10:00:00  85  
 2 C:       2020-12-26 12:00:00  53  
 3 C:       2020-12-26 14:00:00  48.5
 4 C:       2020-12-26 16:00:00  50  
 5 C:       2020-12-26 18:00:00  52  
 6 C:       2020-12-26 20:00:00  50.5
 7 C:       2020-12-26 22:00:00  51  
 8 C:       2020-12-27 00:00:00  48  
 9 C:       2020-12-27 02:00:00  47  
10 C:       2020-12-27 04:00:00  47  
# ... with 202 more rows

The other problem, later in my code I need to convert aggregation period into seconds. But it's text: "15 mins", "55 mins", "3 hours", etc. Really hard to work with it.

Is there a way to do aggregation more simple way than me using seconds as integer, not text like I do?

Upd: If I try to use interval in seconds:

agg_interval <- round (data_lenght / 200 / 15) * 15

agg_period <- paste0(agg_interval, " secs")

agg_data <-
  data %>%  group_by(across(-c(Value, datetime)),  datetime = ceiling_date (datetime, agg_period)) %>%
  summarise (Value = median(Value) , .groups = "drop")

The result is - no aggregation happen:

# A tibble: 50,004 x 3
   Instance datetime            Value
   <chr>    <dttm>              <int>
 1 C:       2020-12-26 10:00:00    85
 2 C:       2020-12-26 10:01:00    19
 3 C:       2020-12-26 10:02:00    43
 4 C:       2020-12-26 10:03:00    83
 5 C:       2020-12-26 10:04:00    67
 6 C:       2020-12-26 10:05:00    28
 7 C:       2020-12-26 10:06:00    54
 8 C:       2020-12-26 10:07:00    28
 9 C:       2020-12-26 10:08:00    99
10 C:       2020-12-26 10:09:00    54
# ... with 49,994 more rows

Upvotes: 0

Views: 133

Answers (2)

G. Grothendieck
G. Grothendieck

Reputation: 269654

cut.POSIXt can be used like this allowing an arbitrary number of seconds.

secs <- 7200
as.POSIXt(cut(data$datetime, paste(secs, "secs")) + secs

Checking we have:

identical(cut(data$datetime, "7200 secs"), cut(data$datetime, "2 hours"))
## [1] TRUE

As you have undoubtedly noticed, unfortunately this does not work with ceiling_date:

identical(ceiling_date(data$datetime, "2 hours"), 
  ceiling_date(data$datetime, "7200 secs"))
## [1] FALSE

Example

secs <- 3750
agg_period <- paste(secs, "secs")

agg_data <- data %>%  
    group_by(across(-c(Value, datetime)),  
      datetime = as.POSIXct(cut(datetime, agg_period)) + secs) %>%
    summarise (Value = median(Value) , .groups = "drop")

dim(agg_data)
## [1] 402   3

Upvotes: 1

r2evans
r2evans

Reputation: 160447

Even though you're using POSIXt, nothing is requiring you to use "3 hours"-like sequencing, you can also specify length.out=. Here's one way.

First, create a range of times for each group, then group everything.

library(dplyr)
N <- 200
newdata1 <- data %>%
  group_by(Instance) %>%
  summarize(datetime = seq(min(datetime), max(datetime), length.out = N)) %>%
  nest_by(.key = "newtimes") %>%
  ungroup()
newdata2 <- data %>%
  nest_by(Instance, .key = "olddata") %>%
  ungroup()

newdata1
# # A tibble: 2 x 2
#   Instance           newtimes
#   <chr>    <list<tbl_df[,1]>>
# 1 C:                [200 x 1]
# 2 D:                [200 x 1]
newdata2
# # A tibble: 2 x 2
#   Instance            olddata
#   <chr>    <list<tbl_df[,2]>>
# 1 C:             [50,002 x 2]
# 2 D:             [50,002 x 2]

Now we can approx to interpolate:

newdata <- left_join(newdata1, newdata2, by = "Instance") %>%
  mutate(newdata = purrr::map2(newtimes, olddata, ~ tibble(newvalue = approx(.y$datetime, .y$Value, xout = .x$datetime)$y))) %>%
  select(-olddata) %>%
  unnest(c(newtimes, newdata))

newdata
# # A tibble: 400 x 3
#    Instance datetime            newvalue
#    <chr>    <dttm>                 <dbl>
#  1 C:       2020-12-26 10:00:00    85   
#  2 C:       2020-12-26 11:02:48     9.22
#  3 C:       2020-12-26 12:05:37    49.2 
#  4 C:       2020-12-26 13:08:26    50.8 
#  5 C:       2020-12-26 14:11:15    92.8 
#  6 C:       2020-12-26 15:14:04    48.7 
#  7 C:       2020-12-26 16:16:53    70.4 
#  8 C:       2020-12-26 17:19:42    64.5 
#  9 C:       2020-12-26 18:22:31    41.7 
# 10 C:       2020-12-26 19:25:20    73.0 
# # ... with 390 more rows

Upvotes: 0

Related Questions