yournamehere
yournamehere

Reputation: 53

Filtering with variable time interval using dplyr

I want to filter my time series based on a variable time interval. More specifically, consider the time t_i from a timestamp t. I want to filter my time series such that what remains is a time series containing only timestamps from t_i - 15 min up to and including t_i + 15 min.

Here's what I tried:

library(lubridate)
library(dplyr)

mv <- 2 # moving window
t <- as.POSIXct("2020-06-20 12:00", tz="UTC") # time stamp
time <- seq(ymd_hm('2020-01-01 00:00'),ymd_hm('2020-12-31 23:45'), by = '15 mins')
ts <- tibble(time=time, data=sin(seq(1,length(time),1)))

# What I did:
ts %>%
  filter(time >= t - mv*24*60*60) %>%
  filter(time <= t) %>%
  filter(strftime(time, format = "%H:%M", tz = "UTC") >= strftime(t-15*60, format = "%H:%M", tz = "UTC")) %>%
  filter(strftime(time, format = "%H:%M", tz = "UTC") <= strftime(t+15*60, format = "%H:%M", tz = "UTC"))

Output:

# A tibble: 7 x 2
  time                   data
  <dttm>                <dbl>
1 2020-06-18 12:00:00 -0.435 
2 2020-06-18 12:15:00  0.523 
3 2020-06-19 11:45:00  0.298 
4 2020-06-19 12:00:00  0.964 
5 2020-06-19 12:15:00  0.744 
6 2020-06-20 11:45:00  0.885 
7 2020-06-20 12:00:00  0.0870

This is exactly what I want but it breaks down when t <- as.POSIXct("2020-06-20 23:45", tz="UTC") (also with 00:00):

# A tibble: 0 x 2
# … with 2 variables: time <dttm>, data <dbl>

I included an if-else statement to circumvent this but it is far from elegant and doesn't give me exactly what I want:

t <- as.POSIXct("2020-06-20 23:45", tz="UTC") # time stamp
if(strftime(t, format = "%H:%M", tz = "UTC") %in% c("23:45","00:00")){
  ts %>% 
    filter(time >= t - mv*24*60*60) %>%
    filter(time <= t) %>%
    filter(strftime(time, format = "%H:%M", tz = "UTC") >= strftime(t-15*60, format = "%H:%M", tz = "UTC"))
} else {
  ts %>% 
    filter(time >= t - mv*24*60*60) %>%
    filter(time <= t) %>%
    filter(strftime(time, format = "%H:%M", tz = "UTC") >= strftime(t-15*60, format = "%H:%M", tz = "UTC")) %>%
    filter(strftime(time, format = "%H:%M", tz = "UTC") <= strftime(t+15*60, format = "%H:%M", tz = "UTC"))
}

Output:
# A tibble: 5 x 2
  time                  data
  <dttm>               <dbl>
1 2020-06-18 23:45:00  0.543
2 2020-06-19 23:30:00 -0.177
3 2020-06-19 23:45:00 -0.924
4 2020-06-20 23:30:00 -0.936
5 2020-06-20 23:45:00 -0.209

Desired output:
# A tibble: 7 x 2
  time                  data
  <dttm>               <dbl>
1 2020-06-18 23:45:00  0.543
2 2020-06-19 00:00:00 -0.413
3 2020-06-19 23:30:00 -0.177
4 2020-06-19 23:45:00 -0.924
5 2020-06-20 00:00:00 -0.821
6 2020-06-20 23:30:00 -0.936
7 2020-06-20 23:45:00 -0.209

There seems to be an issue with the shift between days but I'm not sure how to solve it and I haven't been able to find similar questions. Is there a way to achieve this (elegantly)?

Upvotes: 5

Views: 886

Answers (2)

nachti
nachti

Reputation: 1100

ts %>%
  filter(between(time, t - days(mv), t)) %>%
  mutate(aux = as.numeric(time) %% (60 * 60 * 24)) %>%
  filter(between(aux,
                 (as.numeric(t) %% (60 * 60 * 24) - 900),
                 (as.numeric(t) %% (60 * 60 * 24) + 900)) |
           aux == 0) %>%
  select(-aux)

gives

   # # A tibble: 7 x 2
   #   time                  data
   #   <dttm>               <dbl>
   # 1 2020-06-18 23:45:00  0.543
   # 2 2020-06-19 00:00:00 -0.413
   # 3 2020-06-19 23:30:00 -0.177
   # 4 2020-06-19 23:45:00 -0.924
   # 5 2020-06-20 00:00:00 -0.821
   # 6 2020-06-20 23:30:00 -0.936
   # 7 2020-06-20 23:45:00 -0.209

It's probably very particular for this specific task and a bit hard to read. The interval reflects a duration (fixed amount of seconds). For similar cases, where the date increases, you need to change the offsets and adjust the values by 86400. This version doesn't work if t is as midnight nor if the offset is not equal to 15'.

If you have just 2 days, this would also be an approach (using periods instead of durations):

ts %>%
  filter(between(time, t - days(mv), t)) %>%
  filter(between(time, t - minutes(15), t + minutes(15)) |
         between(time, t - days(1) - minutes(15), t - days(1) + minutes(15)) | 
         between(time, t - days(2) - minutes(15), t - days(2) + minutes(15)))

which gives the same result in this case. If you want to adjust the margins, you need to change the values.

By the way: you should NOT use t as name for an object in R, because it's already the name of a function.

HTH

Upvotes: 1

Serkan
Serkan

Reputation: 1945

It apperars that strftime(ts$time[1], format = "%H:%M", tz = "UTC") > strftime(t, format = "%H:%M", tz = "UTC") is evaluated to FALSE which makes sense depending on how you look at it.

To mitigate this you'll need full YYYY-MM-DD HH:MM such that it is evaluated 'correctly'. Which will be the case if you evaluate the the full string, instead of only the hours.

We can get the intervals by adding a dummy-variable we call time_ that includes all the HH:MM, and then treat them as strings,

# Troublesome Vector;
t <- ymd_hm("2020-06-20 23:45", tz="UTC")




ts %>% filter(
        between(
                time, 
                left = t - mv*24*60*60 -15*60,
                right = t
        )
) %>% mutate(
        time_ = strftime(time, format = "%H:%M", tz = "UTC") %>% as.character()
) %>% filter(
        str_detect(
                time_,
                pattern = seq(
                        t-15*60,
                        t+15*60,
                        by = "15 mins"
                ) %>% strftime(format = "%H:%M", tz = "UTC") %>% paste(
                        collapse = "|"
                )
        )
)

Which gives the output,

# A tibble: 8 x 3
  time                  data time_
  <dttm>               <dbl> <chr>
1 2020-06-18 23:30:00  1.00  23:30
2 2020-06-18 23:45:00  0.543 23:45
3 2020-06-19 00:00:00 -0.413 00:00
4 2020-06-19 23:30:00 -0.177 23:30
5 2020-06-19 23:45:00 -0.924 23:45
6 2020-06-20 00:00:00 -0.821 00:00
7 2020-06-20 23:30:00 -0.936 23:30
8 2020-06-20 23:45:00 -0.209 23:45

Upvotes: 1

Related Questions