Reputation: 815
I have time series dataset with values measured at various frequencies at different sampling locations ('site_no'). I would like to filter down this dataset to remove lots of samples taken in quick succession - within 15-minutes in my case. Here is a simplified example:
library(lubridate)
set.seed(42)
n_sites <- 5
n_rows <- 100
df <- data.frame(
Date_time = ymd_hms("2013-01-01 10:17:00", tz = "GMT") + minutes(0:(n_sites * n_rows - 1) * 2),
site_no = as.character(rep(1:n_sites, each = n_rows)),
Value = rnorm(n_sites * n_rows))
df2 <- data.frame(Date_time = rep(ymd_hms("2013-01-02 05:00:00", tz = "GMT"),times=5),
site_no = as.character(c(1:5)),
Value = c(10,10,10,10,10))
df <- rbind(df,df2)
df <- df[order(df$site_no,df$Date_time),]
What I would like to do, for each site number ('site_no'), is to output a new data frame based on:
So for example, for site_no '1', the first time step is at 10:17am. I would then like to remove the time values between 10:19-10:29am (rows 2-7) and keep row 8 which has a 'date_time' time stamp of 10:31am. This is because this value is the maximum time difference from 10:17am within a 15-minute window. From 10:31am (row 8), I would then like to remove rows 9-14 (10:33-10:43am) and select row 15 that has a timestamp of 10:45am - 14-minutes after 10:31am (the max time differences within a 15-min window).
Lastly, if the time difference between the row versus the preceding row is >15-minutes, I would like to keep both of these. So in the example, I would like to keep the last row per site_no at 5:00am.
If its possible to achieve this in a way that reduces data processing power (i.e., vectorized approaches rather than explicit loops) that would be great as I have a very large dataset.
Many thanks in advance.
Upvotes: 3
Views: 128
Reputation: 84
An alternative function to run with nest/purrr:
filterDate <- function(df) {
t <- df %>% pull(Date_time)
i <- 1
p <- c(i)
m <- length(t)
while(i < m) {
j <- 0
d <- as.numeric(t[seq(i+1,length(t))] - t[i], units = "mins")
if (any(d <= 15 & d > 0)) {
i <- max(which(d <= 15 & d > 0)) + i
} else {
i <- min(which(d > 0)) + i
}
p <- c(p,i)
}
df.filter <- df[p,]
return(df.filter)
}
nest/purrr run:
df %>% nest(d=-c(site_no)) %>% mutate(o=purrr::map(d,filterDate)) %>% unnest(o) %>%
transmute(Date_time,site_no,Value) %>% as.data.frame()
Benchmark result similar to dplyr algorithm:
# A tibble: 4 × 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
1 dplyr 22.8ms 25.9ms 39.9 586.96KB 4.70 17 2 426ms <df [85 × 3]> <Rprofmem [1,139 × 3]> <bench_tm [19]> <tibble [19 × 3]>
2 data.table 19.4ms 19.8ms 50.3 2.06MB 7.54 20 3 398ms <df [85 × 3]> <Rprofmem [1,361 × 3]> <bench_tm [23]> <tibble [23 × 3]>
3 baseR 13.4ms 13.8ms 70.0 789.2KB 10.0 28 4 400ms <df [85 × 3]> <Rprofmem [1,578 × 3]> <bench_tm [32]> <tibble [32 × 3]>
4 new 26.1ms 26.4ms 37.8 482.6KB 4.73 16 2 423ms <df [85 × 3]> <Rprofmem [1,088 × 3]> <bench_tm [18]> <tibble [18 × 3]>
>
Upvotes: 1
Reputation: 160407
I don't know that you can do it without a loop. Here's a simple function that loops as efficiently as it can, bounding by dates found. The worst-case is when all diff
s are over 15 minutes, in which case this iterates over every value in the vector.
Notes:
Whenever I have a while
loop and I'm not always 100% it has a clear exit strategy, I put in an iteration limit to prevent an infinite loop. I did it here using maxiters=length(tm)
, which means it will never loop more times than there are values in the input vector. It is likely not strictly necessary, but I have bitten myself too many times with "clearly it won't go infinite" (and a subsequent "oops") to not do it here, at least in dev.
The data must be pre-sorted by date within each site_no
group.
The site_no
grouping must be handled externally to the function.
The function:
fun <- function(tm, mins = 15, maxiters = length(tm), debug = TRUE) {
out <- replace(tm, -1, tm[1][NA])
lastused <- which.max(!is.na(out))
iter <- 0
while (iter < maxiters) {
if (lastused >= length(tm)) break
iter <- iter + 1
diffs <- as.numeric(tm[-(1:lastused)] - tm[lastused], units = "mins")
if (any(found <- (diffs <= mins)) ) {
nextused <- sum(found)
out[(lastused+1):(lastused+nextused-1)] <- tm[lastused]
out[lastused + nextused] <- tm[lastused + nextused]
lastused <- lastused + nextused
} else {
out[lastused + 1] <- tm[lastused + 1]
lastused <- lastused + 1
}
}
if (debug) message("# took ", iter, " iterations")
out
}
library(dplyr)
df %>%
mutate(prevtime = fun(Date_time), .by = site_no) %>%
slice_head(n = 1, by = c("site_no", "prevtime"))
# # took 16 iterations
# # took 16 iterations
# # took 16 iterations
# # took 16 iterations
# # took 16 iterations
# Date_time site_no Value prevtime
# 1 2013-01-01 10:17:00 1 1.370958447 2013-01-01 10:17:00
# 2 2013-01-01 10:31:00 1 -0.094659038 2013-01-01 10:31:00
# 3 2013-01-01 10:45:00 1 -0.133321336 2013-01-01 10:45:00
# 4 2013-01-01 10:59:00 1 -1.781308434 2013-01-01 10:59:00
# 5 2013-01-01 11:13:00 1 0.460097355 2013-01-01 11:13:00
# 6 2013-01-01 11:27:00 1 -1.717008679 2013-01-01 11:27:00
# 7 2013-01-01 11:41:00 1 0.758163236 2013-01-01 11:41:00
# 8 2013-01-01 11:55:00 1 0.655647883 2013-01-01 11:55:00
# 9 2013-01-01 12:09:00 1 0.679288816 2013-01-01 12:09:00
# 10 2013-01-01 12:23:00 1 1.399736827 2013-01-01 12:23:00
# 11 2013-01-01 12:37:00 1 -1.043118939 2013-01-01 12:37:00
# 12 2013-01-01 12:51:00 1 0.463767589 2013-01-01 12:51:00
# 13 2013-01-01 13:05:00 1 -1.194328895 2013-01-01 13:05:00
# 14 2013-01-01 13:19:00 1 -0.476173923 2013-01-01 13:19:00
# 15 2013-01-01 13:33:00 1 0.079982553 2013-01-01 13:33:00
# 16 2013-01-01 13:35:00 1 0.653204340 2013-01-01 13:35:00
# 17 2013-01-02 05:00:00 1 10.000000000 2013-01-02 05:00:00
# 18 2013-01-01 13:37:00 2 1.200965376 2013-01-01 13:37:00
# 19 2013-01-01 13:51:00 2 -0.122350172 2013-01-01 13:51:00
# 20 2013-01-01 14:05:00 2 -1.661099080 2013-01-01 14:05:00
# 21 2013-01-01 14:19:00 2 -1.470435741 2013-01-01 14:19:00
# 22 2013-01-01 14:33:00 2 -1.224747950 2013-01-01 14:33:00
# 23 2013-01-01 14:47:00 2 -1.097113768 2013-01-01 14:47:00
# 24 2013-01-01 15:01:00 2 -0.444684005 2013-01-01 15:01:00
# 25 2013-01-01 15:15:00 2 -1.056368413 2013-01-01 15:15:00
# 26 2013-01-01 15:29:00 2 -0.007762034 2013-01-01 15:29:00
# 27 2013-01-01 15:43:00 2 -0.362738416 2013-01-01 15:43:00
# 28 2013-01-01 15:57:00 2 -0.229778139 2013-01-01 15:57:00
# 29 2013-01-01 16:11:00 2 0.643008700 2013-01-01 16:11:00
# 30 2013-01-01 16:25:00 2 -0.279259373 2013-01-01 16:25:00
# 31 2013-01-01 16:39:00 2 -0.345087978 2013-01-01 16:39:00
# 32 2013-01-01 16:53:00 2 1.815228446 2013-01-01 16:53:00
# 33 2013-01-01 16:55:00 2 0.128821429 2013-01-01 16:55:00
# 34 2013-01-02 05:00:00 2 10.000000000 2013-01-02 05:00:00
# 35 2013-01-01 16:57:00 3 -2.000929238 2013-01-01 16:57:00
# 36 2013-01-01 17:11:00 3 -1.054055782 2013-01-01 17:11:00
# 37 2013-01-01 17:25:00 3 0.495619642 2013-01-01 17:25:00
# 38 2013-01-01 17:39:00 3 -0.351512874 2013-01-01 17:39:00
# 39 2013-01-01 17:53:00 3 -0.658503426 2013-01-01 17:53:00
# 40 2013-01-01 18:07:00 3 -0.390965408 2013-01-01 18:07:00
# 41 2013-01-01 18:21:00 3 1.258481665 2013-01-01 18:21:00
# 42 2013-01-01 18:35:00 3 -0.971385229 2013-01-01 18:35:00
# 43 2013-01-01 18:49:00 3 -0.738440754 2013-01-01 18:49:00
# 44 2013-01-01 19:03:00 3 -1.851555663 2013-01-01 19:03:00
# 45 2013-01-01 19:17:00 3 0.573751697 2013-01-01 19:17:00
# 46 2013-01-01 19:31:00 3 -1.242670271 2013-01-01 19:31:00
# 47 2013-01-01 19:45:00 3 0.043722008 2013-01-01 19:45:00
# 48 2013-01-01 19:59:00 3 0.446041053 2013-01-01 19:59:00
# 49 2013-01-01 20:13:00 3 0.097340485 2013-01-01 20:13:00
# 50 2013-01-01 20:15:00 3 -1.625616739 2013-01-01 20:15:00
# 51 2013-01-02 05:00:00 3 10.000000000 2013-01-02 05:00:00
# 52 2013-01-01 20:17:00 4 -0.004620768 2013-01-01 20:17:00
# 53 2013-01-01 20:31:00 4 0.992943637 2013-01-01 20:31:00
# 54 2013-01-01 20:45:00 4 0.586807720 2013-01-01 20:45:00
# 55 2013-01-01 20:59:00 4 0.189128812 2013-01-01 20:59:00
# 56 2013-01-01 21:13:00 4 -0.835205805 2013-01-01 21:13:00
# 57 2013-01-01 21:27:00 4 -0.073458335 2013-01-01 21:27:00
# 58 2013-01-01 21:41:00 4 -0.434617039 2013-01-01 21:41:00
# 59 2013-01-01 21:55:00 4 1.353361894 2013-01-01 21:55:00
# 60 2013-01-01 22:09:00 4 -0.290145312 2013-01-01 22:09:00
# 61 2013-01-01 22:23:00 4 -0.336311209 2013-01-01 22:23:00
# 62 2013-01-01 22:37:00 4 1.628442266 2013-01-01 22:37:00
# 63 2013-01-01 22:51:00 4 -1.109418760 2013-01-01 22:51:00
# 64 2013-01-01 23:05:00 4 -0.195656817 2013-01-01 23:05:00
# 65 2013-01-01 23:19:00 4 -0.301869926 2013-01-01 23:19:00
# 66 2013-01-01 23:33:00 4 -0.255607655 2013-01-01 23:33:00
# 67 2013-01-01 23:35:00 4 0.931032901 2013-01-01 23:35:00
# 68 2013-01-02 05:00:00 4 10.000000000 2013-01-02 05:00:00
# 69 2013-01-01 23:37:00 5 1.334912585 2013-01-01 23:37:00
# 70 2013-01-01 23:51:00 5 0.655511883 2013-01-01 23:51:00
# 71 2013-01-02 00:05:00 5 -0.777351759 2013-01-02 00:05:00
# 72 2013-01-02 00:19:00 5 -1.453529565 2013-01-02 00:19:00
# 73 2013-01-02 00:33:00 5 0.152608159 2013-01-02 00:33:00
# 74 2013-01-02 00:47:00 5 0.890356305 2013-01-02 00:47:00
# 75 2013-01-02 01:01:00 5 1.429338080 2013-01-02 01:01:00
# 76 2013-01-02 01:15:00 5 0.546115158 2013-01-02 01:15:00
# 77 2013-01-02 01:29:00 5 1.618343936 2013-01-02 01:29:00
# 78 2013-01-02 01:43:00 5 -1.083075142 2013-01-02 01:43:00
# 79 2013-01-02 01:57:00 5 -0.009056475 2013-01-02 01:57:00
# 80 2013-01-02 02:11:00 5 -0.283647452 2013-01-02 02:11:00
# 81 2013-01-02 02:25:00 5 0.761863447 2013-01-02 02:25:00
# 82 2013-01-02 02:39:00 5 -0.115135986 2013-01-02 02:39:00
# 83 2013-01-02 02:53:00 5 0.121258850 2013-01-02 02:53:00
# 84 2013-01-02 02:55:00 5 -0.011221686 2013-01-02 02:55:00
# 85 2013-01-02 05:00:00 5 10.000000000 2013-01-02 05:00:00
library(data.table)
as.data.table(df)[, prevtime := fun(Date_time), by = .(site_no)
][, .SD[1,], by = .(site_no, prevtime)
][, prevtime := NULL]
(The columns are in a different order, otherwise identical to the dplyr method above.)
A bit more work, but it produces the same results as dplyr and data.table above.
split(df, df$site_no) |>
lapply(function(site) {
transform(site, prevtime = fun(Date_time, debug=F)) |>
transform(grp = cumsum(c(TRUE, prevtime[-1] != prevtime[-length(prevtime)]))) |>
subset(ave(grp, grp, FUN = seq_along) == 1)
}) |>
do.call(rbind.data.frame, args = _) |>
subset(select = -c(prevtime, grp))
All three produce the same output, albeit with minor caveats: the data.table
method reorders columns and a different class object, and the base-R solution preserves the original row names. Both of those are cosmetic, but for the sake of benchmarking I'll fix those changes so that bench::mark(.)
will confirm that all outputs are the same.
bench::mark(
dplyr = {
df %>%
mutate(prevtime = fun(Date_time, debug=F), .by = site_no) %>%
slice_head(n = 1, by = c("site_no", "prevtime")) %>%
select(-prevtime)
},
data.table = {
as.data.table(df)[, prevtime := fun(Date_time, debug=F), by = .(site_no)
][, .SD[1,], by = .(site_no, prevtime)
][, prevtime := NULL] |>
# data.table is reordering columns above, aesthetic fix only for bench::mark
setcolorder(names(df)) |>
as.data.frame()
},
baseR = {
split(df, df$site_no) |>
lapply(function(site) {
transform(site, prevtime = fun(Date_time, debug=F)) |>
transform(grp = cumsum(c(TRUE, prevtime[-1] != prevtime[-length(prevtime)]))) |>
subset(ave(grp, grp, FUN = seq_along) == 1)
}) |>
do.call(rbind.data.frame, args = _) |>
subset(select = -c(prevtime, grp)) |>
# the original row names are preserved, aesthetic fix only for bench::mark
`rownames<-`(NULL)
}
)
# # A tibble: 3 × 13
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
# 1 dplyr 11ms 11.32ms 85.0 NA 6.07 28 2 329ms <df [85 × 3]> <NULL> <bench_tm [30]> <tibble [30 × 3]>
# 2 data.table 10.65ms 11.13ms 81.9 NA 2.56 32 1 391ms <df [85 × 3]> <NULL> <bench_tm [33]> <tibble [33 × 3]>
# 3 baseR 6.98ms 7.45ms 130. NA 2.66 49 1 376ms <df [85 × 3]> <NULL> <bench_tm [50]> <tibble [50 × 3]>
I admit I'm a little surprised that the base-R was fastest (and data.table
the slowest!) among the three, but with larger data this may not always be the case.
Upvotes: 2