Filter out values within certain time differences within inconsistent time series dataset

Question

I have time series dataset with values measured at various frequencies at different sampling locations ('site_no'). I would like to filter down this dataset to remove lots of samples taken in quick succession - within 15-minutes in my case. Here is a simplified example:

library(lubridate)
set.seed(42)
n_sites <- 5
n_rows <- 100
df <- data.frame(
 Date_time = ymd_hms("2013-01-01 10:17:00", tz = "GMT") + minutes(0:(n_sites * n_rows - 1) * 2),
site_no = as.character(rep(1:n_sites, each = n_rows)),
 Value = rnorm(n_sites * n_rows))
df2 <- data.frame(Date_time = rep(ymd_hms("2013-01-02 05:00:00", tz = "GMT"),times=5),
              site_no = as.character(c(1:5)),
              Value = c(10,10,10,10,10))
df <- rbind(df,df2)
df <- df[order(df$site_no,df$Date_time),]

What I would like to do, for each site number ('site_no'), is to output a new data frame based on:

selecting the first row (earliest date/time) of each site_no
searching up to 15-minutes in the future from the first row of each site_no;
identifying the next row with with the largest time difference value that is less than or equal to 15-minutes;
removing any rows with time differences between this;
repeating this process for the next time step;

So for example, for site_no '1', the first time step is at 10:17am. I would then like to remove the time values between 10:19-10:29am (rows 2-7) and keep row 8 which has a 'date_time' time stamp of 10:31am. This is because this value is the maximum time difference from 10:17am within a 15-minute window. From 10:31am (row 8), I would then like to remove rows 9-14 (10:33-10:43am) and select row 15 that has a timestamp of 10:45am - 14-minutes after 10:31am (the max time differences within a 15-min window).

Lastly, if the time difference between the row versus the preceding row is >15-minutes, I would like to keep both of these. So in the example, I would like to keep the last row per site_no at 5:00am.

If its possible to achieve this in a way that reduces data processing power (i.e., vectorized approaches rather than explicit loops) that would be great as I have a very large dataset.

Many thanks in advance.

r2evans · Accepted Answer

I don't know that you can do it without a loop. Here's a simple function that loops as efficiently as it can, bounding by dates found. The worst-case is when all diffs are over 15 minutes, in which case this iterates over every value in the vector.

Notes:

Whenever I have a while loop and I'm not always 100% it has a clear exit strategy, I put in an iteration limit to prevent an infinite loop. I did it here using maxiters=length(tm), which means it will never loop more times than there are values in the input vector. It is likely not strictly necessary, but I have bitten myself too many times with "clearly it won't go infinite" (and a subsequent "oops") to not do it here, at least in dev.
The data must be pre-sorted by date within each site_no group.
The site_no grouping must be handled externally to the function.

The function:


fun <- function(tm, mins = 15, maxiters = length(tm), debug = TRUE) {
  out <- replace(tm, -1, tm[1][NA])
  lastused <- which.max(!is.na(out))
  iter <- 0
  while (iter < maxiters) {
    if (lastused >= length(tm)) break
    iter <- iter + 1
    diffs <- as.numeric(tm[-(1:lastused)] - tm[lastused], units = "mins")
    if (any(found <- (diffs <= mins)) ) {
      nextused <- sum(found)
      out[(lastused+1):(lastused+nextused-1)] <- tm[lastused]
      out[lastused + nextused] <- tm[lastused + nextused]
      lastused <- lastused + nextused
    } else {
      out[lastused + 1] <- tm[lastused + 1]
      lastused <- lastused + 1
    }
  }
  if (debug) message("# took ", iter, " iterations")
  out
}

dplyr

library(dplyr)
df %>%
  mutate(prevtime = fun(Date_time), .by = site_no) %>%
  slice_head(n = 1, by = c("site_no", "prevtime"))
# # took 16 iterations
# # took 16 iterations
# # took 16 iterations
# # took 16 iterations
# # took 16 iterations
#              Date_time site_no        Value            prevtime
# 1  2013-01-01 10:17:00       1  1.370958447 2013-01-01 10:17:00
# 2  2013-01-01 10:31:00       1 -0.094659038 2013-01-01 10:31:00
# 3  2013-01-01 10:45:00       1 -0.133321336 2013-01-01 10:45:00
# 4  2013-01-01 10:59:00       1 -1.781308434 2013-01-01 10:59:00
# 5  2013-01-01 11:13:00       1  0.460097355 2013-01-01 11:13:00
# 6  2013-01-01 11:27:00       1 -1.717008679 2013-01-01 11:27:00
# 7  2013-01-01 11:41:00       1  0.758163236 2013-01-01 11:41:00
# 8  2013-01-01 11:55:00       1  0.655647883 2013-01-01 11:55:00
# 9  2013-01-01 12:09:00       1  0.679288816 2013-01-01 12:09:00
# 10 2013-01-01 12:23:00       1  1.399736827 2013-01-01 12:23:00
# 11 2013-01-01 12:37:00       1 -1.043118939 2013-01-01 12:37:00
# 12 2013-01-01 12:51:00       1  0.463767589 2013-01-01 12:51:00
# 13 2013-01-01 13:05:00       1 -1.194328895 2013-01-01 13:05:00
# 14 2013-01-01 13:19:00       1 -0.476173923 2013-01-01 13:19:00
# 15 2013-01-01 13:33:00       1  0.079982553 2013-01-01 13:33:00
# 16 2013-01-01 13:35:00       1  0.653204340 2013-01-01 13:35:00
# 17 2013-01-02 05:00:00       1 10.000000000 2013-01-02 05:00:00
# 18 2013-01-01 13:37:00       2  1.200965376 2013-01-01 13:37:00
# 19 2013-01-01 13:51:00       2 -0.122350172 2013-01-01 13:51:00
# 20 2013-01-01 14:05:00       2 -1.661099080 2013-01-01 14:05:00
# 21 2013-01-01 14:19:00       2 -1.470435741 2013-01-01 14:19:00
# 22 2013-01-01 14:33:00       2 -1.224747950 2013-01-01 14:33:00
# 23 2013-01-01 14:47:00       2 -1.097113768 2013-01-01 14:47:00
# 24 2013-01-01 15:01:00       2 -0.444684005 2013-01-01 15:01:00
# 25 2013-01-01 15:15:00       2 -1.056368413 2013-01-01 15:15:00
# 26 2013-01-01 15:29:00       2 -0.007762034 2013-01-01 15:29:00
# 27 2013-01-01 15:43:00       2 -0.362738416 2013-01-01 15:43:00
# 28 2013-01-01 15:57:00       2 -0.229778139 2013-01-01 15:57:00
# 29 2013-01-01 16:11:00       2  0.643008700 2013-01-01 16:11:00
# 30 2013-01-01 16:25:00       2 -0.279259373 2013-01-01 16:25:00
# 31 2013-01-01 16:39:00       2 -0.345087978 2013-01-01 16:39:00
# 32 2013-01-01 16:53:00       2  1.815228446 2013-01-01 16:53:00
# 33 2013-01-01 16:55:00       2  0.128821429 2013-01-01 16:55:00
# 34 2013-01-02 05:00:00       2 10.000000000 2013-01-02 05:00:00
# 35 2013-01-01 16:57:00       3 -2.000929238 2013-01-01 16:57:00
# 36 2013-01-01 17:11:00       3 -1.054055782 2013-01-01 17:11:00
# 37 2013-01-01 17:25:00       3  0.495619642 2013-01-01 17:25:00
# 38 2013-01-01 17:39:00       3 -0.351512874 2013-01-01 17:39:00
# 39 2013-01-01 17:53:00       3 -0.658503426 2013-01-01 17:53:00
# 40 2013-01-01 18:07:00       3 -0.390965408 2013-01-01 18:07:00
# 41 2013-01-01 18:21:00       3  1.258481665 2013-01-01 18:21:00
# 42 2013-01-01 18:35:00       3 -0.971385229 2013-01-01 18:35:00
# 43 2013-01-01 18:49:00       3 -0.738440754 2013-01-01 18:49:00
# 44 2013-01-01 19:03:00       3 -1.851555663 2013-01-01 19:03:00
# 45 2013-01-01 19:17:00       3  0.573751697 2013-01-01 19:17:00
# 46 2013-01-01 19:31:00       3 -1.242670271 2013-01-01 19:31:00
# 47 2013-01-01 19:45:00       3  0.043722008 2013-01-01 19:45:00
# 48 2013-01-01 19:59:00       3  0.446041053 2013-01-01 19:59:00
# 49 2013-01-01 20:13:00       3  0.097340485 2013-01-01 20:13:00
# 50 2013-01-01 20:15:00       3 -1.625616739 2013-01-01 20:15:00
# 51 2013-01-02 05:00:00       3 10.000000000 2013-01-02 05:00:00
# 52 2013-01-01 20:17:00       4 -0.004620768 2013-01-01 20:17:00
# 53 2013-01-01 20:31:00       4  0.992943637 2013-01-01 20:31:00
# 54 2013-01-01 20:45:00       4  0.586807720 2013-01-01 20:45:00
# 55 2013-01-01 20:59:00       4  0.189128812 2013-01-01 20:59:00
# 56 2013-01-01 21:13:00       4 -0.835205805 2013-01-01 21:13:00
# 57 2013-01-01 21:27:00       4 -0.073458335 2013-01-01 21:27:00
# 58 2013-01-01 21:41:00       4 -0.434617039 2013-01-01 21:41:00
# 59 2013-01-01 21:55:00       4  1.353361894 2013-01-01 21:55:00
# 60 2013-01-01 22:09:00       4 -0.290145312 2013-01-01 22:09:00
# 61 2013-01-01 22:23:00       4 -0.336311209 2013-01-01 22:23:00
# 62 2013-01-01 22:37:00       4  1.628442266 2013-01-01 22:37:00
# 63 2013-01-01 22:51:00       4 -1.109418760 2013-01-01 22:51:00
# 64 2013-01-01 23:05:00       4 -0.195656817 2013-01-01 23:05:00
# 65 2013-01-01 23:19:00       4 -0.301869926 2013-01-01 23:19:00
# 66 2013-01-01 23:33:00       4 -0.255607655 2013-01-01 23:33:00
# 67 2013-01-01 23:35:00       4  0.931032901 2013-01-01 23:35:00
# 68 2013-01-02 05:00:00       4 10.000000000 2013-01-02 05:00:00
# 69 2013-01-01 23:37:00       5  1.334912585 2013-01-01 23:37:00
# 70 2013-01-01 23:51:00       5  0.655511883 2013-01-01 23:51:00
# 71 2013-01-02 00:05:00       5 -0.777351759 2013-01-02 00:05:00
# 72 2013-01-02 00:19:00       5 -1.453529565 2013-01-02 00:19:00
# 73 2013-01-02 00:33:00       5  0.152608159 2013-01-02 00:33:00
# 74 2013-01-02 00:47:00       5  0.890356305 2013-01-02 00:47:00
# 75 2013-01-02 01:01:00       5  1.429338080 2013-01-02 01:01:00
# 76 2013-01-02 01:15:00       5  0.546115158 2013-01-02 01:15:00
# 77 2013-01-02 01:29:00       5  1.618343936 2013-01-02 01:29:00
# 78 2013-01-02 01:43:00       5 -1.083075142 2013-01-02 01:43:00
# 79 2013-01-02 01:57:00       5 -0.009056475 2013-01-02 01:57:00
# 80 2013-01-02 02:11:00       5 -0.283647452 2013-01-02 02:11:00
# 81 2013-01-02 02:25:00       5  0.761863447 2013-01-02 02:25:00
# 82 2013-01-02 02:39:00       5 -0.115135986 2013-01-02 02:39:00
# 83 2013-01-02 02:53:00       5  0.121258850 2013-01-02 02:53:00
# 84 2013-01-02 02:55:00       5 -0.011221686 2013-01-02 02:55:00
# 85 2013-01-02 05:00:00       5 10.000000000 2013-01-02 05:00:00

data.table

library(data.table)
as.data.table(df)[, prevtime := fun(Date_time), by = .(site_no)
                  ][, .SD[1,], by = .(site_no, prevtime)
                    ][, prevtime := NULL]

(The columns are in a different order, otherwise identical to the dplyr method above.)

base R

A bit more work, but it produces the same results as dplyr and data.table above.

split(df, df$site_no) |>
  lapply(function(site) {
    transform(site, prevtime = fun(Date_time, debug=F)) |>
      transform(grp = cumsum(c(TRUE, prevtime[-1] != prevtime[-length(prevtime)]))) |>
      subset(ave(grp, grp, FUN = seq_along) == 1)
  }) |>
  do.call(rbind.data.frame, args = _) |>
  subset(select = -c(prevtime, grp))

Benchmark/Comparison

All three produce the same output, albeit with minor caveats: the data.table method reorders columns and a different class object, and the base-R solution preserves the original row names. Both of those are cosmetic, but for the sake of benchmarking I'll fix those changes so that bench::mark(.) will confirm that all outputs are the same.

bench::mark(
  dplyr = {
    df %>%
      mutate(prevtime = fun(Date_time, debug=F), .by = site_no) %>%
      slice_head(n = 1, by = c("site_no", "prevtime")) %>%
      select(-prevtime)
  },
  data.table = {
    as.data.table(df)[, prevtime := fun(Date_time, debug=F), by = .(site_no)
                      ][, .SD[1,], by = .(site_no, prevtime)
                        ][, prevtime := NULL] |>
      # data.table is reordering columns above, aesthetic fix only for bench::mark
      setcolorder(names(df)) |>
      as.data.frame()
  },
  baseR = {
    split(df, df$site_no) |>
      lapply(function(site) {
        transform(site, prevtime = fun(Date_time, debug=F)) |>
          transform(grp = cumsum(c(TRUE, prevtime[-1] != prevtime[-length(prevtime)]))) |>
          subset(ave(grp, grp, FUN = seq_along) == 1)
      }) |>
      do.call(rbind.data.frame, args = _) |>
      subset(select = -c(prevtime, grp)) |>
      # the original row names are preserved, aesthetic fix only for bench::mark
      `rownames<-`(NULL)
  }
)

# # A tibble: 3 × 13
#   expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result        memory time            gc               
#                                                   
# 1 dplyr          11ms  11.32ms      85.0        NA     6.07    28     2      329ms    
# 2 data.table  10.65ms  11.13ms      81.9        NA     2.56    32     1      391ms    
# 3 baseR        6.98ms   7.45ms     130.         NA     2.66    49     1      376ms

I admit I'm a little surprised that the base-R was fastest (and data.table the slowest!) among the three, but with larger data this may not always be the case.

Filter out values within certain time differences within inconsistent time series dataset

Answers (2)

dplyr

data.table

base R

Benchmark/Comparison

Related Questions