Reputation: 2677
I'd like to create a variable containing the value of a variable in the previous year within a group.
id date value
1 1 1992 4.1
2 1 NA 4.5
3 1 1991 3.3
4 1 1990 5.3
5 1 1994 3.0
6 2 1992 3.2
7 2 1991 5.2
value_lagged
should be missing when the previous year is missing within a group - either because it is the first date within a group (as in row 4, 7), or because there are year gaps in the data (as in row 5). Also, value_lagged
should be missing when the current time is missing (as in row 2).
This gives:
id date value value_lagged
1 1 1992 4.1 3.3
2 1 NA 4.5 NA
3 1 1991 3.3 5.3
4 1 1990 5.3 NA
5 1 1994 3.0 NA
6 2 1992 3.2 5.2
7 2 1991 5.2 NA
For now, in R, I use the data.table
package
DT = data.table(id = c(1,1,1,1,1,2,2),
date = c(1992,NA,1991,1990,1994,1992,1991),
value = c(4.1,4.5,3.3,5.3,3.0,3.2,5.2)
)
setkey(DT, id, date)
DT[, value_lagged := DT[J(id, date-1), value], ]
DT[is.na(date), value_lagged := NA, ]
It's fast but it seems somewhat error prone to me. I'd like to know if there are better alternatives using data.table
, dplyr
, or any other package. Thanks a lot!
In Stata
, one would do:
tsset id date
gen value_lagged=L.value
Upvotes: 14
Views: 7654
Reputation: 2677
Create a function tlag
, which lags a vector given a vector of times, and use it within groups defined by id
library(dplyr)
tlag <- function(x, n = 1L, time) {
index <- match(time - n, time, incomparables = NA)
x[index]
}
df %>% group_by(id) %>% mutate(value_lagged = tlag(value, 1, time = date))
Upvotes: 7
Reputation: 1369
An elegant and fast general solution to computations on irregular time series and unbalanced panels is now offered by the collapse
package with the functions flag
, fdiff
and fgrowth
. See a general answer for lagging an unbalanced panel here.
Now in your specific application there is an additional rare fact that the panel is not only unbalanced, but you have a missing value in your time variable which means you don't know the time period when a record was observed. In such cases just applying collapse::flag
won't do, but you could generate a new id variable that places the missing value in a separate group with collapse::seqid
. So my solution would be:
library(collapse)
DF = data.frame(id = c(1,1,1,1,1,2,2),
date = c(1992,NA,1991,1990,1994,1992,1991),
value = c(4.1,4.5,3.3,5.3,3.0,3.2,5.2))
settransform(DF, l_value = flag(value, 1, g = seqid(date, order(id, date)), t = date))
DF
#> id date value l_value
#> 1 1 1992 4.1 3.3
#> 2 1 NA 4.5 NA
#> 3 1 1991 3.3 5.3
#> 4 1 1990 5.3 NA
#> 5 1 1994 3.0 NA
#> 6 2 1992 3.2 5.2
#> 7 2 1991 5.2 NA
Created on 2021-07-10 by the reprex package (v0.3.0)
I am quite confident that this is still faster than data.table
, but I have not tested this. Again this data is rather odd, for most cases where the panel is unbalanced but the records are all identified by id and time, a simple flag(value, 1, id, as.integer(date))
will do fine and be striking fast. Note that you get efficiency gains by ensuring that the time variable is integer, as flag
will coerce non-integer time variables to factor, which could also get rid of the irregularity.
Upvotes: 0
Reputation: 118789
Using 1.9.5
, where joins don't need keys to be set, this can be done as follows:
require(data.table) # v1.9.5+
DT[!is.na(date), value_lagged :=
.SD[.(id = id, date = date - 1), value, on = c("id", "date")]]
# id date value value_lagged
# 1: 1 1992 4.1 3.3
# 2: 1 NA 4.5 NA
# 3: 1 1991 3.3 5.3
# 4: 1 1990 5.3 NA
# 5: 1 1994 3.0 NA
# 6: 2 1992 3.2 5.2
# 7: 2 1991 5.2 NA
It's a variation of your idea. The trick is to use is.na()
directly in i
and use .SD
in j
instead of DT
. I've used on=
syntax, but the same idea can of course be done by setting keys as well. .
Upvotes: 8
Reputation: 103898
I'd probably tackle this using a join:
library(dplyr)
df <- data.frame(
id = c(1, 1, 1, 1, 1, 2, 2),
date = c(1992, NA, 1991, 1990, 1994, 1992, 1991),
value = c(4.1, 4.5, 3.3, 5.3, 3.0, 3.2, 5.2)
)
last_year <- df %>%
filter(!is.na(date)) %>%
mutate(date = date + 1, lagged_value = value, value = NULL)
df %>%
left_join(last_year)
#> Joining by: c("id", "date")
#> id date value lagged_value
#> 1 1 1992 4.1 3.3
#> 2 1 NA 4.5 NA
#> 3 1 1991 3.3 5.3
#> 4 1 1990 5.3 NA
#> 5 1 1994 3.0 NA
#> 6 2 1992 3.2 5.2
#> 7 2 1991 5.2 NA
Upvotes: 11