Reputation: 281
I have timeseries of different length (1 to 14 days usually) with interval of 15 seconds between samples. I need to keep only N point of all that data for every group using aggregation with some predefined function (median, min, max, etc). The reason - I want to show it on plot and too many points make a mess on in, better to split the data and show median, or min/max for short intervals.
The problem is if I use lubridate ceiling_date function for aggregation I am really limited with aggregation periods. It supports only "N hours" or "N mins" format, not even "75m" or "1500s" or "1 hours 5 mins".
But what I really need - to divide length of my data by N and calculate aggregation interval in seconds. Let's say my data length is 8.68 days = 8.682460*60 = 749952 seconds. Let's say I want to have 200 points. My aggregation period should be 749952/200 = 3749,76 ~ 3750 seconds. But I have to use "2 hours" instead of it.
That's my code with example data:
library(dplyr)
library(lubridate)
set.seed(900)
data1 <-
data.frame(
datetime = seq.POSIXt(as.POSIXct("2020-12-26 10:00:00"), as.POSIXct("2020-12-26 10:00:00") + 15*50001, "15 sec"),
Value = sample(1:100, 50002, replace = T),
Instance = "C:"
)
data2 <- data.frame(
datetime = seq.POSIXt(as.POSIXct("2020-12-26 10:00:00"), as.POSIXct("2020-12-26 10:00:00") + 15*50001, "15 sec"),
Value = sample(1:100, 50002, replace = T),
Instance = "D:"
)
data <- rbind (data1, data2) %>% arrange(datetime)
data_lenght <-
difftime(max(data$datetime), min(data$datetime), units = "secs")
agg_interval <- data_lenght / 200
if (agg_interval > 3600) {
N_hours <- ceiling(agg_interval / 60 / 60)
agg_period <- paste0(N_hours, " hours")
} else {
N_minutes <- ceiling(agg_interval / 60)
agg_period <- paste0(N_minutes, " mins")
}
agg_data <-
data %>% group_by(across(-c(Value, datetime)), datetime = ceiling_date (datetime, agg_period)) %>%
summarise (Value = median(Value) , .groups = "drop")
Result:
# A tibble: 212 x 3
Instance datetime Value
<chr> <dttm> <dbl>
1 C: 2020-12-26 10:00:00 85
2 C: 2020-12-26 12:00:00 53
3 C: 2020-12-26 14:00:00 48.5
4 C: 2020-12-26 16:00:00 50
5 C: 2020-12-26 18:00:00 52
6 C: 2020-12-26 20:00:00 50.5
7 C: 2020-12-26 22:00:00 51
8 C: 2020-12-27 00:00:00 48
9 C: 2020-12-27 02:00:00 47
10 C: 2020-12-27 04:00:00 47
# ... with 202 more rows
The other problem, later in my code I need to convert aggregation period into seconds. But it's text: "15 mins", "55 mins", "3 hours", etc. Really hard to work with it.
Is there a way to do aggregation more simple way than me using seconds as integer, not text like I do?
Upd: If I try to use interval in seconds:
agg_interval <- round (data_lenght / 200 / 15) * 15
agg_period <- paste0(agg_interval, " secs")
agg_data <-
data %>% group_by(across(-c(Value, datetime)), datetime = ceiling_date (datetime, agg_period)) %>%
summarise (Value = median(Value) , .groups = "drop")
The result is - no aggregation happen:
# A tibble: 50,004 x 3
Instance datetime Value
<chr> <dttm> <int>
1 C: 2020-12-26 10:00:00 85
2 C: 2020-12-26 10:01:00 19
3 C: 2020-12-26 10:02:00 43
4 C: 2020-12-26 10:03:00 83
5 C: 2020-12-26 10:04:00 67
6 C: 2020-12-26 10:05:00 28
7 C: 2020-12-26 10:06:00 54
8 C: 2020-12-26 10:07:00 28
9 C: 2020-12-26 10:08:00 99
10 C: 2020-12-26 10:09:00 54
# ... with 49,994 more rows
Upvotes: 0
Views: 133
Reputation: 269654
cut.POSIXt
can be used like this allowing an arbitrary number of seconds.
secs <- 7200
as.POSIXt(cut(data$datetime, paste(secs, "secs")) + secs
Checking we have:
identical(cut(data$datetime, "7200 secs"), cut(data$datetime, "2 hours"))
## [1] TRUE
As you have undoubtedly noticed, unfortunately this does not work with ceiling_date:
identical(ceiling_date(data$datetime, "2 hours"),
ceiling_date(data$datetime, "7200 secs"))
## [1] FALSE
secs <- 3750
agg_period <- paste(secs, "secs")
agg_data <- data %>%
group_by(across(-c(Value, datetime)),
datetime = as.POSIXct(cut(datetime, agg_period)) + secs) %>%
summarise (Value = median(Value) , .groups = "drop")
dim(agg_data)
## [1] 402 3
Upvotes: 1
Reputation: 160447
Even though you're using POSIXt
, nothing is requiring you to use "3 hours"
-like sequencing, you can also specify length.out=
. Here's one way.
First, create a range of times for each group, then group everything.
library(dplyr)
N <- 200
newdata1 <- data %>%
group_by(Instance) %>%
summarize(datetime = seq(min(datetime), max(datetime), length.out = N)) %>%
nest_by(.key = "newtimes") %>%
ungroup()
newdata2 <- data %>%
nest_by(Instance, .key = "olddata") %>%
ungroup()
newdata1
# # A tibble: 2 x 2
# Instance newtimes
# <chr> <list<tbl_df[,1]>>
# 1 C: [200 x 1]
# 2 D: [200 x 1]
newdata2
# # A tibble: 2 x 2
# Instance olddata
# <chr> <list<tbl_df[,2]>>
# 1 C: [50,002 x 2]
# 2 D: [50,002 x 2]
Now we can approx
to interpolate:
newdata <- left_join(newdata1, newdata2, by = "Instance") %>%
mutate(newdata = purrr::map2(newtimes, olddata, ~ tibble(newvalue = approx(.y$datetime, .y$Value, xout = .x$datetime)$y))) %>%
select(-olddata) %>%
unnest(c(newtimes, newdata))
newdata
# # A tibble: 400 x 3
# Instance datetime newvalue
# <chr> <dttm> <dbl>
# 1 C: 2020-12-26 10:00:00 85
# 2 C: 2020-12-26 11:02:48 9.22
# 3 C: 2020-12-26 12:05:37 49.2
# 4 C: 2020-12-26 13:08:26 50.8
# 5 C: 2020-12-26 14:11:15 92.8
# 6 C: 2020-12-26 15:14:04 48.7
# 7 C: 2020-12-26 16:16:53 70.4
# 8 C: 2020-12-26 17:19:42 64.5
# 9 C: 2020-12-26 18:22:31 41.7
# 10 C: 2020-12-26 19:25:20 73.0
# # ... with 390 more rows
Upvotes: 0