Reputation: 323
Using the following dataset:
set.seed(2)
origin <- rep(c("DEU", "GBR", "ITA", "NLD", "CAN", "MEX", "USA", "CHN", "JPN", "KOR","DEU", "GBR", "ITA", "NLD", "CAN", "MEX", "USA", "CHN", "JPN", "KOR"), 6)
dest <- rep(c("GBR", "ITA", "NLD", "CAN", "MEX", "USA", "CHN", "JPN", "KOR","DEU", "GBR", "ITA", "NLD", "CAN", "MEX", "USA", "CHN", "JPN", "KOR", "DEU"), 6)
year <- rep(c(rep(1998, 10), rep(1999, 10), rep(2000, 10)), 2)
type <- rep(c(1,2,3,4,5), 12)
# type <- sample(1:10, size=length(origin), replace=TRUE)
a <- sample(100:10000, size=length(origin), replace=TRUE)
b <- sample(1000:100000, size=length(origin), replace=TRUE)
data.df <- as.data.frame(cbind(origin, dest, year, type, a,b))
rm(origin, year, dest, type, a,b)
I would like to compute, for instance, the following operation:
i being type
, j origin
and k dest
. I decided to first compute the lag of a, lag.a
with dplyr
:
data.df <- data.df %>%
group_by(origin, dest, type) %>%
mutate(lag.a = lag(a, n = 1, default = NA))
I think this way is correct even if I do not understand well how R can understand alone what is the time reference to consider... ??
Btw, doing so I obtained a result corresponding to the first part (a t+1 ijk - a t ijk ) of my computation. My problem is that I now I no idea of how i can do (lag.a t+1 ijk * b t ik )... Any idea?
If possible I would like a solution (dplyr
or data.table
), with no mutate of the lag variable into the dataset to not weigh it down more than necessary.
Upvotes: 0
Views: 382
Reputation: 141
There are a couple of problems in your code. First, create your data.frame
like this:
data.df <- data.frame(origin, dest, year, type, a, b)
This will retain the class of all the vectors. Note that if you don't want origin
and dest
to be factors, just use the argument stringsAsFactors = FALSE
in the data.frame()
function.
Next, create your new variable as follows:
data.df2 <- data.df %>%
group_by(origin, dest, type) %>%
arrange(year) %>%
mutate(new_var = (a - lag(a)) * b) %>%
ungroup()
Here, new_var
is the variable that you want. You are right in that dplyr
doesn't know that the lagged value is from the previous time period. Therefore, you have to use arrange(year)
.
Upvotes: 1