Reputation: 77
The problem I have is as follows:
How can we check what values were observed at a certain point of time across all processes? How can we produce periodical aggregates(e.g. a mean daily value) based on known information?
Here is a mockup of the data:
tsVector1 <- c("2020-01-01 16:18", "2020-01-03 12:12", "2020-01-03 14:56", "2020-01-10 09:48", "2020-01-11 23:00")
tsVector2 <- c("2020-01-01 16:18", "2020-01-02 14:10", "2020-01-04 18:36", "2020-01-06 19:23", "2020-01-10 21:10")
tsVector3 <- c("2020-01-01 16:18", "2020-01-05 13:37", "2020-01-08 01:11")
obsVector1 <- c(1, 1.2, 1.6, 3, 3.8)
obsVector2 <- c(1, 1.4, 1.7, 1.8, 2.1)
obsVector3 <- c(1:3)
procVector1 <- rep("proc1", 5)
procVector2 <- rep("proc2", 5)
procVector3 <- rep("proc3", 3)
tsDF <- data.frame(ts = as.POSIXct(c(tsVector1, tsVector2, tsVector3)),
observation = c(obsVector1, obsVector2, obsVector3),
process = c(procVector1, procVector2, procVector3))
I am looking for an output like this:
SomeHelpfulFunction(tsDF, "2020-01-07 01:00")
> process observation
> process1 1.6
> process2 1.8
> process3 2.0
Is there a way to get this data without resorting to loops or replication of all unique timestamps across all of the time series being compared?
UPDATE:
Having had a look at the replies and testing them out with a bigger dataset, I found that the data.table
method works best for my use case and also results in faster response times. To give you an idea, the times I have measured were:
data.table
duplicated()
dplyr
with top_n()
instead of slice_tail()
Upvotes: 0
Views: 571
Reputation: 77
I have found an additional way of achieving the ruslt with base R alone. But it relies on duplicated()
and is only good for getting the last reading, so will not really work with applications where with last N readings are required:
SomeHelpfulFunction <- function(tsDF, time) {
functionLocalDF <- tsDF[order(tsDF$process, tsDF$ts), ]
functionLocalDF <- functionLocalDF[functionLocalDF$ ts <= time, ]
functionLocalDF[!duplicated(functionLocalDF$process, fromLast = T), ]
}
Upvotes: 0
Reputation: 17725
Here's a cute solution using a rolling join in data.table
:
library(data.table)
SomeHelpfulFunction <- function(tsDF, time) {
a = data.table(tsDF,
key=c("process", "ts"))
b = data.table(process = unique(tsDF$process),
ts = as.POSIXct(time),
key = c("process", "ts"))
a[b, roll=TRUE]
}
SomeHelpfulFunction(tsDF, "2020-01-07 1:00")
#> ts observation process
#> 1: 2020-01-07 01:00:00 1.6 proc1
#> 2: 2020-01-07 01:00:00 1.8 proc2
#> 3: 2020-01-07 01:00:00 2.0 proc3
I found this blog post useful in explaining rolling joins: https://r-norberg.blogspot.com/2016/06/understanding-datatable-rolling-joins.html
Upvotes: 1
Reputation: 4658
I think you want the last value before the time you are querying, for every process. With dplyr
:
library(dplyr)
tsDF %>%
group_by(process) %>%
arrange(ts) %>%
filter(ts <= "2020-01-07 01:00") %>%
slice_tail() %>%
ungroup() %>%
select(-ts)
which gives
# A tibble: 3 x 2
observation process
<dbl> <fct>
1 1.6 proc1
2 1.8 proc2
3 2 proc3
Upvotes: 1