Ric S
Ric S

Reputation: 9277

Create lag/lead variable in dplyr with dynamic number of position

I'm looking for a method to generate a column derived from a lag column with dynamic number of position (argument n), which means that this new column should take as argument n the value stored in another column (see lag function for documentation).

Sample data:

set.seed(42)
df <- as_tibble(data.frame(
  id = c(rep(1,6), rep(2,5), rep(3,6)),
  n_steps = c(0,0,0,0,1,2,0,0,1:3,0,1:5),
  var1 = sample(1:9, 17, replace = TRUE),
  var2 = runif(17, 1, 2)))

# A tibble: 17 x 4
      id n_steps  var1  var2
   <dbl>   <dbl> <int> <dbl>
 1     1       0     1  1.08
 2     1       0     5  1.51
 3     1       0     1  1.39
 4     1       0     9  1.91
 5     1       1     4  1.45
 6     1       2     2  1.84
 7     2       0     1  1.74
 8     2       0     8  1.81
 9     2       1     7  1.39
10     2       2     4  1.69
11     2       3     9  1.00
12     3       0     5  1.83
13     3       1     4  1.01
14     3       2     2  1.21
15     3       3     3  1.91
16     3       4     9  1.61
17     3       5     9  1.38

More practically speaking, my idea is that I want to create a variable var3 with the following conditions:

I tried the following code

df %>% 
  group_by(id) %>% 
  mutate(var3 = ifelse(n_steps == 0, var1, var1 + lag(var2, n = n_steps)))

which returns the following error

Error: Problem with mutate() input var3. n must be a nonnegative integer scalar, not a double vector of length 6. Input var3 is ifelse(n_steps == 0, var1, var1 + lag(var2, n = n_steps)). The error occured in group 1: id = 1. Run rlang::last_error() to see where the error occurred.

and I understand it: the argument n needs a scalar value. I thought of referring to the current row, but I didn't find anything about it on SO. Also, I could create N new variables for each value assumed by n_steps, but I was looking for a way to use these lag columns dynamically without actually creating them. How could I achieve it?

Expected output:

      id n_steps  var1  var2  var3
 1     1       0     1  1.08  1   
 2     1       0     5  1.51  5   
 3     1       0     1  1.39  1   
 4     1       0     9  1.91  9   
 5     1       1     4  1.45  5.91
 6     1       2     2  1.84  3.91
 7     2       0     1  1.74  1   
 8     2       0     8  1.81  8   
 9     2       1     7  1.39  8.81
10     2       2     4  1.69  5.81
11     2       3     9  1.00 10.81 
12     3       0     5  1.83  5   
13     3       1     4  1.01  5.83
14     3       2     2  1.21  3.83
15     3       3     3  1.91  4.83
16     3       4     9  1.61 10.83 
17     3       5     9  1.38 10.83

Upvotes: 2

Views: 1776

Answers (2)

Georgery
Georgery

Reputation: 8127

Here's a solution with a for-loop:

df <- df %>%
    mutate(var3 = NA)

for (i in 1:nrow(df)){
    df$var3[i] = df$var1[i] + df$n_steps[i] * df$var2[i - df$n_steps[i]]
}
df

# A tibble: 17 x 5
      id n_steps  var1  var2  var3
   <dbl>   <dbl> <int> <dbl> <dbl>
 1     1       0     1  1.08  1   
 2     1       0     5  1.51  5   
 3     1       0     1  1.39  1   
 4     1       0     9  1.91  9   
 5     1       1     4  1.45  5.91
 6     1       2     2  1.84  5.81
 7     2       0     1  1.74  1   
 8     2       0     8  1.81  8   
 9     2       1     7  1.39  8.81
10     2       2     4  1.69  7.62
11     2       3     9  1.00 14.4 
12     3       0     5  1.83  5   
13     3       1     4  1.01  5.83
14     3       2     2  1.21  5.67
15     3       3     3  1.91  8.50
16     3       4     9  1.61 16.3 
17     3       5     9  1.38 18.2 

The reason that lag does not work is that it is vectorized, but vectorization is not really possible for your problem.

Upvotes: 0

BellmanEqn
BellmanEqn

Reputation: 799

Try this?

df %>% 
  mutate(var3 = ifelse(n_steps == 0, var1, var1 + var2[row_number()-n_steps]))

Upvotes: 5

Related Questions