Whitebeard
Whitebeard

Reputation: 6193

Create multiple lead variables in data.table

This question is similar to Creating a bunch of lagged variables in data.table at once and How to create a lag variable within each group?, but not quite the same as far as I can tell.

I would like to create several lead variables, e.g. lead1, lead2, and lead3 below, grouped by groups.

Example data

require(data.table)
set.seed(1)
data <- data.table(time =c(1:10,1:8),groups = c(rep(c("a","b"),c(10,8))), 
   value = rnorm(18))
data
    time groups       value
 1:    1      a -0.62645381
 2:    2      a  0.18364332
 3:    3      a -0.83562861
 4:    4      a  1.59528080
 5:    5      a  0.32950777
 6:    6      a -0.82046838
 7:    7      a  0.48742905
 8:    8      a  0.73832471
 9:    9      a  0.57578135
10:   10      a -0.30538839
11:    1      b  1.51178117
12:    2      b  0.38984324
13:    3      b -0.62124058
14:    4      b -2.21469989
15:    5      b  1.12493092
16:    6      b -0.04493361
17:    7      b -0.01619026
18:    8      b  0.94383621

The resulting data table should be

   time groups       value       lead1       lead2       lead3
1     1      a -0.62645381  0.18364332 -0.83562861  1.59528080
2     2      a  0.18364332 -0.83562861  1.59528080  0.32950777
3     3      a -0.83562861  1.59528080  0.32950777 -0.82046838
4     4      a  1.59528080  0.32950777 -0.82046838  0.48742905
5     5      a  0.32950777 -0.82046838  0.48742905  0.73832471
6     6      a -0.82046838  0.48742905  0.73832471  0.57578135
7     7      a  0.48742905  0.73832471  0.57578135 -0.30538839
8     8      a  0.73832471  0.57578135 -0.30538839          NA
9     9      a  0.57578135 -0.30538839          NA          NA
10   10      a -0.30538839          NA          NA          NA
11    1      b  1.51178117  0.38984324 -0.62124058 -2.21469989
12    2      b  0.38984324 -0.62124058 -2.21469989  1.12493092 
13    3      b -0.62124058 -2.21469989  1.12493092 -0.04493361
14    4      b -2.21469989  1.12493092 -0.04493361 -0.01619026
15    5      b  1.12493092 -0.04493361 -0.01619026  0.94383621
16    6      b -0.04493361 -0.01619026  0.94383621          NA
17    7      b -0.01619026  0.94383621          NA          NA
18    8      b  0.94383621          NA          NA          NA

Note that my actual data set is much larger and I may need more than 3 lead variables.

I am using data.table version 1.9.4 and am not sure when I will be able to update to the latest version, so a solution in this version would be a bonus. Sorry for this additional constraint.

Thanks in advance.

Upvotes: 1

Views: 1402

Answers (3)

eipi10
eipi10

Reputation: 93791

The function below will create lead columns for all lead values specified in the leads argument, which is just a vector of positive integers.

library(data.table)

lead.n = function(leads, values) {
  as.data.frame(sapply(leads, function(n) {
    lead.vals = c(values[-c(1:n)], rep(NA,n))
  }
  ))
}

data[, paste0("lead",1:3):=lead.n(1:3,value), by=groups]

    time groups       value       lead1       lead2       lead3
 1:    1      a -0.62645381  0.18364332 -0.83562861  1.59528080
 2:    2      a  0.18364332 -0.83562861  1.59528080  0.32950777
 3:    3      a -0.83562861  1.59528080  0.32950777 -0.82046838
 4:    4      a  1.59528080  0.32950777 -0.82046838  0.48742905
 5:    5      a  0.32950777 -0.82046838  0.48742905  0.73832471
 6:    6      a -0.82046838  0.48742905  0.73832471  0.57578135
 7:    7      a  0.48742905  0.73832471  0.57578135 -0.30538839
 8:    8      a  0.73832471  0.57578135 -0.30538839          NA
 9:    9      a  0.57578135 -0.30538839          NA          NA
10:   10      a -0.30538839          NA          NA          NA
11:    1      b  1.51178117  0.38984324 -0.62124058 -2.21469989
12:    2      b  0.38984324 -0.62124058 -2.21469989  1.12493092
13:    3      b -0.62124058 -2.21469989  1.12493092 -0.04493361
14:    4      b -2.21469989  1.12493092 -0.04493361 -0.01619026
15:    5      b  1.12493092 -0.04493361 -0.01619026  0.94383621
16:    6      b -0.04493361 -0.01619026  0.94383621          NA
17:    7      b -0.01619026  0.94383621          NA          NA
18:    8      b  0.94383621          NA          NA          NA

Upvotes: 4

David Arenburg
David Arenburg

Reputation: 92282

The standard data.table way would be using the built in shift function (as was already mentioned in the linked thread). you will need the latest stable version on CRAN for this - v 1.9.6+

library(data.table) # V1.9.6+
data[, paste0("lead", 1L:3L) := shift(value, 1L:3L, type = "lead"), by = groups]
data
#     time groups       value       lead1       lead2       lead3
#  1:    1      a -0.62645381  0.18364332 -0.83562861  1.59528080
#  2:    2      a  0.18364332 -0.83562861  1.59528080  0.32950777
#  3:    3      a -0.83562861  1.59528080  0.32950777 -0.82046838
#  4:    4      a  1.59528080  0.32950777 -0.82046838  0.48742905
#  5:    5      a  0.32950777 -0.82046838  0.48742905  0.73832471
#  6:    6      a -0.82046838  0.48742905  0.73832471  0.57578135
#  7:    7      a  0.48742905  0.73832471  0.57578135 -0.30538839
#  8:    8      a  0.73832471  0.57578135 -0.30538839          NA
#  9:    9      a  0.57578135 -0.30538839          NA          NA
# 10:   10      a -0.30538839          NA          NA          NA
# 11:    1      b  1.51178117  0.38984324 -0.62124058 -2.21469989
# 12:    2      b  0.38984324 -0.62124058 -2.21469989  1.12493092
# 13:    3      b -0.62124058 -2.21469989  1.12493092 -0.04493361
# 14:    4      b -2.21469989  1.12493092 -0.04493361 -0.01619026
# 15:    5      b  1.12493092 -0.04493361 -0.01619026  0.94383621
# 16:    6      b -0.04493361 -0.01619026  0.94383621          NA
# 17:    7      b -0.01619026  0.94383621          NA          NA
# 18:    8      b  0.94383621          NA          NA          NA

Upvotes: 8

LyzandeR
LyzandeR

Reputation: 37879

You can use function lead from dplyr and do it in one call with data.table like this:

library(data.table)
library(dplyr)

data[, c('lead1','lead2','lead3') := list(lead(value, 1), lead(value, 2), lead(value, 3))
     , by=groups]

Output

> data
    time groups       value       lead1       lead2       lead3
 1:    1      a -0.62645381  0.18364332 -0.83562861  1.59528080
 2:    2      a  0.18364332 -0.83562861  1.59528080  0.32950777
 3:    3      a -0.83562861  1.59528080  0.32950777 -0.82046838
 4:    4      a  1.59528080  0.32950777 -0.82046838  0.48742905
 5:    5      a  0.32950777 -0.82046838  0.48742905  0.73832471
 6:    6      a -0.82046838  0.48742905  0.73832471  0.57578135
 7:    7      a  0.48742905  0.73832471  0.57578135 -0.30538839
 8:    8      a  0.73832471  0.57578135 -0.30538839          NA
 9:    9      a  0.57578135 -0.30538839          NA          NA
10:   10      a -0.30538839          NA          NA          NA
11:    1      b  1.51178117  0.38984324 -0.62124058 -2.21469989
12:    2      b  0.38984324 -0.62124058 -2.21469989  1.12493092
13:    3      b -0.62124058 -2.21469989  1.12493092 -0.04493361
14:    4      b -2.21469989  1.12493092 -0.04493361 -0.01619026
15:    5      b  1.12493092 -0.04493361 -0.01619026  0.94383621
16:    6      b -0.04493361 -0.01619026  0.94383621          NA
17:    7      b -0.01619026  0.94383621          NA          NA
18:    8      b  0.94383621          NA          NA          NA

Upvotes: 2

Related Questions