Matthew
Matthew

Reputation: 2677

Create lagged variable in unbalanced panel data in R

I'd like to create a variable containing the value of a variable in the previous year within a group.

     id   date        value
1     1   1992          4.1  
2     1     NA          4.5  
3     1   1991          3.3  
4     1   1990          5.3  
5     1   1994          3.0  
6     2   1992          3.2  
7     2   1991          5.2  

value_lagged should be missing when the previous year is missing within a group - either because it is the first date within a group (as in row 4, 7), or because there are year gaps in the data (as in row 5). Also, value_lagged should be missing when the current time is missing (as in row 2).

This gives:

     id   date    value    value_lagged  
1     1   1992      4.1             3.3
2     1     NA      4.5              NA
3     1   1991      3.3             5.3
4     1   1990      5.3              NA
5     1   1994      3.0              NA
6     2   1992      3.2             5.2
7     2   1991      5.2              NA

For now, in R, I use the data.table package

 DT = data.table(id    = c(1,1,1,1,1,2,2),
                 date  = c(1992,NA,1991,1990,1994,1992,1991),
                 value = c(4.1,4.5,3.3,5.3,3.0,3.2,5.2)
                )
 setkey(DT, id, date)
 DT[, value_lagged := DT[J(id, date-1), value], ]
 DT[is.na(date), value_lagged := NA, ]

It's fast but it seems somewhat error prone to me. I'd like to know if there are better alternatives using data.table, dplyr, or any other package. Thanks a lot!


In Stata, one would do:

    tsset id date
    gen value_lagged=L.value

Upvotes: 14

Views: 7654

Answers (4)

Matthew
Matthew

Reputation: 2677

Create a function tlag, which lags a vector given a vector of times, and use it within groups defined by id

library(dplyr)
tlag <- function(x, n = 1L, time) { 
  index <- match(time - n, time, incomparables = NA)
  x[index]
}

df %>% group_by(id) %>% mutate(value_lagged = tlag(value, 1, time = date))

Upvotes: 7

Sebastian
Sebastian

Reputation: 1369

An elegant and fast general solution to computations on irregular time series and unbalanced panels is now offered by the collapse package with the functions flag, fdiff and fgrowth. See a general answer for lagging an unbalanced panel here.

Now in your specific application there is an additional rare fact that the panel is not only unbalanced, but you have a missing value in your time variable which means you don't know the time period when a record was observed. In such cases just applying collapse::flag won't do, but you could generate a new id variable that places the missing value in a separate group with collapse::seqid. So my solution would be:

library(collapse)  
DF = data.frame(id    = c(1,1,1,1,1,2,2),
                date  = c(1992,NA,1991,1990,1994,1992,1991),
                value = c(4.1,4.5,3.3,5.3,3.0,3.2,5.2))

settransform(DF, l_value = flag(value, 1, g = seqid(date, order(id, date)), t = date))
DF
#>   id date value l_value
#> 1  1 1992   4.1     3.3
#> 2  1   NA   4.5      NA
#> 3  1 1991   3.3     5.3
#> 4  1 1990   5.3      NA
#> 5  1 1994   3.0      NA
#> 6  2 1992   3.2     5.2
#> 7  2 1991   5.2      NA

Created on 2021-07-10 by the reprex package (v0.3.0)

I am quite confident that this is still faster than data.table, but I have not tested this. Again this data is rather odd, for most cases where the panel is unbalanced but the records are all identified by id and time, a simple flag(value, 1, id, as.integer(date)) will do fine and be striking fast. Note that you get efficiency gains by ensuring that the time variable is integer, as flag will coerce non-integer time variables to factor, which could also get rid of the irregularity.

Upvotes: 0

Arun
Arun

Reputation: 118789

Using 1.9.5, where joins don't need keys to be set, this can be done as follows:

require(data.table) # v1.9.5+
DT[!is.na(date), value_lagged := 
         .SD[.(id = id, date = date - 1), value, on = c("id", "date")]]
#    id date value value_lagged
# 1:  1 1992   4.1          3.3
# 2:  1   NA   4.5           NA
# 3:  1 1991   3.3          5.3
# 4:  1 1990   5.3           NA
# 5:  1 1994   3.0           NA
# 6:  2 1992   3.2          5.2
# 7:  2 1991   5.2           NA

It's a variation of your idea. The trick is to use is.na() directly in i and use .SD in j instead of DT. I've used on= syntax, but the same idea can of course be done by setting keys as well. .

Upvotes: 8

hadley
hadley

Reputation: 103898

I'd probably tackle this using a join:

library(dplyr)

df <- data.frame(
  id = c(1, 1, 1, 1, 1, 2, 2), 
  date = c(1992, NA, 1991, 1990, 1994, 1992, 1991), 
  value = c(4.1, 4.5, 3.3, 5.3, 3.0, 3.2, 5.2)
)


last_year <- df %>% 
  filter(!is.na(date)) %>%
  mutate(date = date + 1, lagged_value = value, value = NULL)

df %>%
  left_join(last_year)
#> Joining by: c("id", "date")
#>   id date value lagged_value
#> 1  1 1992   4.1          3.3
#> 2  1   NA   4.5           NA
#> 3  1 1991   3.3          5.3
#> 4  1 1990   5.3           NA
#> 5  1 1994   3.0           NA
#> 6  2 1992   3.2          5.2
#> 7  2 1991   5.2           NA

Upvotes: 11

Related Questions