Ai4l2s
Ai4l2s

Reputation: 623

R: Aggregate data in sliding window into new columns

let's say I have a dataframe like this:

df <- tibble(ID = c(1, 1, 1, 1, 1), v1 = c(3, 5, 1, 0, 1), v2 = c(10, 6, 1, 20, 23), Time = c(as.POSIXct("1900-01-01 10:00:00"), as.POSIXct("1900-01-01 11:00:00"), as.POSIXct("1900-01-01 13:00:00"), as.POSIXct("1900-01-01 16:00:00"), as.POSIXct("1900-01-01 20:00:00"))) %>% group_by(ID)
# A tibble: 5 x 4
# Groups:   ID [1]
     ID    v1    v2 Time               
  <dbl> <dbl> <dbl> <dttm>             
1     1     3    10 1900-01-01 10:00:00
2     1     5     6 1900-01-01 11:00:00
3     1     1     1 1900-01-01 13:00:00
4     1     0    20 1900-01-01 16:00:00
5     1     1    23 1900-01-01 20:00:00

In words, this is a simple timeseries of a specific ID with two values v1 and v2 per time. As quite common in machine learning, I want to aggregate the last n timesteps into one feature vector. For all previous timesteps there should be a time reference in hours when this data point occured. For the first row, where no previous timestep is available, the data should be filled with zeros.

Let's make an example. In this case n=2, that is I want to aggregate the current time step (t2) and the prevopus (t1) together:

# A tibble: 5 x 6
     ID v1_t1 v2_t1 time_t1 v1_t2 v2_t2
  <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl>
1     1     0     0      NA     3    10
2     1     3    10       1     5     6
3     1     5     6       2     1     1
4     1     1     1       3     0    20
5     1     0    20       4     1    23

I want to keep that as generic as possible, so that n can change and the number of data columns. Any idea how to do this? Thanks :)

Upvotes: 0

Views: 67

Answers (1)

stefan
stefan

Reputation: 125338

Using dplyr::lag and dplyr::across you could do:

library(dplyr, warn=FALSE)
library(lubridate, warn=FALSE)

df %>%
  group_by(ID) %>%
  mutate(time_t1 = lubridate::hour(Time) - lag(lubridate::hour(Time))) %>%
  mutate(across(c(v1, v2), .fns = list(t2 = ~.x, t1 = ~lag(.x, default = 0)))) %>% 
  select(-v1, -v2, -Time)
#> # A tibble: 5 × 6
#> # Groups:   ID [1]
#>      ID time_t1 v1_t2 v1_t1 v2_t2 v2_t1
#>   <dbl>   <int> <dbl> <dbl> <dbl> <dbl>
#> 1     1      NA     3     0    10     0
#> 2     1       1     5     3     6    10
#> 3     1       2     1     5     1     6
#> 4     1       3     0     1    20     1
#> 5     1       4     1     0    23    20

UPDATE Here is a more generic approach which makes use of some function factories to create list of functions which could then be passed to the .fns argument of across. Haven't tested for the more general case but should work for any n or number of lags to include and also for any number of data columns.

library(dplyr, warn=FALSE)
library(lubridate, warn=FALSE)

fun_factory1 <- function(n) {
  function(x) {
    lubridate::hour(x) - lag(lubridate::hour(x), n = n)
  }
}

fun_factory2 <- function(n) {
  function(x) {
    lag(x, n = n, default = 0)
  }
}

n <- 2

fns1 <- lapply(seq(n - 1), fun_factory1)
names(fns1) <- paste0("t", seq(n - 1))

fns2 <- lapply(seq(n) - 1, fun_factory2)
names(fns2) <- paste0("t", seq(n))

df %>%
  group_by(ID) %>%
  mutate(across(Time, .fns = fns1)) %>%
  mutate(across(c(v1, v2), .fns = fns2)) %>% 
  select(-v1, -v2, -Time)
#> # A tibble: 5 × 6
#> # Groups:   ID [1]
#>      ID Time_t1 v1_t1 v1_t2 v2_t1 v2_t2
#>   <dbl>   <int> <dbl> <dbl> <dbl> <dbl>
#> 1     1      NA     3     0    10     0
#> 2     1       1     5     3     6    10
#> 3     1       2     1     5     1     6
#> 4     1       3     0     1    20     1
#> 5     1       4     1     0    23    20

Upvotes: 1

Related Questions