Reputation: 77

Comparing time series with different timestamps in R

The problem I have is as follows:

Several processes are being observed and changes are recorded
We only have the time of the change and observation value
Since neither of the time series will have identical set of timestamps, direct comparison between those is not possible
But we do know, that unless a change was recorded, the observation value stayed the same

How can we check what values were observed at a certain point of time across all processes? How can we produce periodical aggregates(e.g. a mean daily value) based on known information?

Here is a mockup of the data:

tsVector1 <- c("2020-01-01 16:18", "2020-01-03 12:12", "2020-01-03 14:56", "2020-01-10 09:48", "2020-01-11 23:00")
tsVector2 <- c("2020-01-01 16:18", "2020-01-02 14:10", "2020-01-04 18:36", "2020-01-06 19:23", "2020-01-10 21:10")
tsVector3 <- c("2020-01-01 16:18", "2020-01-05 13:37", "2020-01-08 01:11")
obsVector1 <- c(1, 1.2, 1.6, 3, 3.8)
obsVector2 <- c(1, 1.4, 1.7, 1.8, 2.1)
obsVector3 <- c(1:3)
procVector1 <- rep("proc1", 5)
procVector2 <- rep("proc2", 5)
procVector3 <- rep("proc3", 3)
tsDF <- data.frame(ts = as.POSIXct(c(tsVector1, tsVector2, tsVector3)),
                   observation = c(obsVector1, obsVector2, obsVector3), 
                   process = c(procVector1, procVector2, procVector3))

I am looking for an output like this:

SomeHelpfulFunction(tsDF, "2020-01-07 01:00")
>  process  observation
> process1  1.6
> process2  1.8
> process3  2.0

Is there a way to get this data without resorting to loops or replication of all unique timestamps across all of the time series being compared?

UPDATE: Having had a look at the replies and testing them out with a bigger dataset, I found that the data.table method works best for my use case and also results in faster response times. To give you an idea, the times I have measured were:

33.46s - data.table
49.36s - base R duplicated()
219.53s - dplyr with top_n() instead of slice_tail()

Upvotes: 0

Answers (3)

bmonger

Reputation: 77

I have found an additional way of achieving the ruslt with base R alone. But it relies on duplicated() and is only good for getting the last reading, so will not really work with applications where with last N readings are required:

SomeHelpfulFunction <- function(tsDF, time) {
    functionLocalDF <- tsDF[order(tsDF$process, tsDF$ts), ]
    functionLocalDF <- functionLocalDF[functionLocalDF$ ts <= time, ]
    functionLocalDF[!duplicated(functionLocalDF$process, fromLast = T), ]
}

Upvotes: 0

Vincent

Reputation: 17868

Here's a cute solution using a rolling join in data.table:

library(data.table)

SomeHelpfulFunction <- function(tsDF, time) {
  a = data.table(tsDF,
                 key=c("process", "ts"))
  b = data.table(process = unique(tsDF$process),
                 ts = as.POSIXct(time),
                 key = c("process", "ts"))
  a[b, roll=TRUE]
}

SomeHelpfulFunction(tsDF, "2020-01-07 1:00")
#>                     ts observation process
#> 1: 2020-01-07 01:00:00         1.6   proc1
#> 2: 2020-01-07 01:00:00         1.8   proc2
#> 3: 2020-01-07 01:00:00         2.0   proc3

I found this blog post useful in explaining rolling joins: https://r-norberg.blogspot.com/2016/06/understanding-datatable-rolling-joins.html

Upvotes: 1

Bas

Reputation: 4658

I think you want the last value before the time you are querying, for every process. With dplyr:

library(dplyr)

tsDF %>% 
  group_by(process) %>% 
  arrange(ts) %>% 
  filter(ts <= "2020-01-07 01:00") %>% 
  slice_tail() %>% 
  ungroup() %>% 
  select(-ts)

which gives

# A tibble: 3 x 2
  observation process
        <dbl> <fct>  
1         1.6 proc1  
2         1.8 proc2  
3         2   proc3

Upvotes: 1

Comparing time series with different timestamps in R

Answers (3)

Related Questions