TeYaP
TeYaP

Reputation: 323

R - Computations with lag variable by group

Using the following dataset:

set.seed(2)
origin <- rep(c("DEU", "GBR", "ITA", "NLD", "CAN", "MEX", "USA", "CHN", "JPN", "KOR","DEU", "GBR", "ITA", "NLD", "CAN", "MEX", "USA", "CHN", "JPN", "KOR"), 6)
dest <- rep(c("GBR", "ITA", "NLD", "CAN", "MEX", "USA", "CHN", "JPN", "KOR","DEU", "GBR", "ITA", "NLD", "CAN", "MEX", "USA", "CHN", "JPN", "KOR", "DEU"), 6)
year <- rep(c(rep(1998, 10), rep(1999, 10), rep(2000, 10)), 2)
type <- rep(c(1,2,3,4,5), 12)
# type <- sample(1:10, size=length(origin), replace=TRUE)
a <- sample(100:10000, size=length(origin), replace=TRUE)
b <- sample(1000:100000, size=length(origin), replace=TRUE)
data.df <- as.data.frame(cbind(origin, dest, year, type, a,b))
rm(origin, year, dest, type, a,b)

I would like to compute, for instance, the following operation:

i being type, j origin and k dest. I decided to first compute the lag of a, lag.a with dplyr:

data.df <- data.df %>%
            group_by(origin, dest, type) %>%
            mutate(lag.a = lag(a, n = 1, default = NA))

I think this way is correct even if I do not understand well how R can understand alone what is the time reference to consider... ??

Btw, doing so I obtained a result corresponding to the first part (a t+1 ijk - a t ijk ) of my computation. My problem is that I now I no idea of how i can do (lag.a t+1 ijk * b t ik )... Any idea?

If possible I would like a solution (dplyr or data.table), with no mutate of the lag variable into the dataset to not weigh it down more than necessary.

Upvotes: 0

Views: 382

Answers (1)

Ashwin Malshe
Ashwin Malshe

Reputation: 141

There are a couple of problems in your code. First, create your data.frame like this:

data.df <- data.frame(origin, dest, year, type, a, b)

This will retain the class of all the vectors. Note that if you don't want origin and dest to be factors, just use the argument stringsAsFactors = FALSE in the data.frame() function.

Next, create your new variable as follows:

data.df2 <- data.df %>%
  group_by(origin, dest, type) %>%
    arrange(year) %>% 
    mutate(new_var = (a - lag(a)) * b) %>%
  ungroup()

Here, new_var is the variable that you want. You are right in that dplyr doesn't know that the lagged value is from the previous time period. Therefore, you have to use arrange(year).

Upvotes: 1

Related Questions