Reputation: 1800
I am trying to find an easier way to calculate change in a variable (represented by a column) in a data frame using dplyr
. My toy data set is something like this
structure(list(CAR = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 1L,
2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L, 4L, 5L, 6L), .Label = c("a",
"b", "c", "d", "e", "f"), class = "factor"), TIME = c(0L, 0L,
0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L
), VAR = c(20L, 30L, 40L, 50L, 60L, 70L, 30L, 40L, 50L, 60L,
70L, 80L, 40L, 50L, 60L, 70L, 80L, 90L)), .Names = c("CAR", "TIME",
"VAR"), class = "data.frame", row.names = c(NA, -18L))
which looks like
CAR TIME VAR
1 a 0 20
2 b 0 30
3 c 0 40
4 d 0 50
5 e 0 60
6 f 0 70
7 a 1 30
8 b 1 40
9 c 1 50
10 d 1 60
11 e 1 70
12 f 1 80
13 a 2 40
14 b 2 50
15 c 2 60
16 d 2 70
17 e 2 80
18 f 2 90
I am trying to calculate change in VAR
between TIME
equal to 0
and other times, e.g., 1,2
for each CAR
.
This is what I do, which seems a very convoluted way, first I get values of VAR
at TIME
equals to 0
library(dplyr)
X <- local_test %>% filter(TIME == 0) %>% group_by(CAR) %>% mutate(baseline_VAR = VAR)
X
looks like
Source: local data frame [6 x 4]
Groups: CAR
CAR TIME VAR baseline_VAR
1 a 0 20 20
2 b 0 30 30
3 c 0 40 40
4 d 0 50 50
5 e 0 60 60
6 f 0 70 70
then, I do a left_join
with the original data frame local_test
Y <- left_join(local_test, X, by = c("CAR"))
Y
looks like
CAR TIME.x VAR.x TIME.y VAR.y baseline_VAR
1 a 0 20 0 20 20
2 b 0 30 0 30 30
3 c 0 40 0 40 40
4 d 0 50 0 50 50
5 e 0 60 0 60 60
6 f 0 70 0 70 70
7 a 1 30 0 20 20
8 b 1 40 0 30 30
9 c 1 50 0 40 40
10 d 1 60 0 50 50
11 e 1 70 0 60 60
12 f 1 80 0 70 70
13 a 2 40 0 20 20
14 b 2 50 0 30 30
15 c 2 60 0 40 40
16 d 2 70 0 50 50
17 e 2 80 0 60 60
18 f 2 90 0 70 70
finally, I add a column in Y
, which calculates the change in VAR
between two different TIME
for CAR
Y %>% group_by(CAR) %>% mutate(change_VAR = VAR.x - baseline_VAR)
Final Y
looks like
Source: local data frame [18 x 7]
Groups: CAR
CAR TIME.x VAR.x TIME.y VAR.y baseline_VAR change_VAR
1 a 0 20 0 20 20 0
2 b 0 30 0 30 30 0
3 c 0 40 0 40 40 0
4 d 0 50 0 50 50 0
5 e 0 60 0 60 60 0
6 f 0 70 0 70 70 0
7 a 1 30 0 20 20 10
8 b 1 40 0 30 30 10
9 c 1 50 0 40 40 10
10 d 1 60 0 50 50 10
11 e 1 70 0 60 60 10
12 f 1 80 0 70 70 10
13 a 2 40 0 20 20 20
14 b 2 50 0 30 30 20
15 c 2 60 0 40 40 20
16 d 2 70 0 50 50 20
17 e 2 80 0 60 60 20
18 f 2 90 0 70 70 20
This seems like a lot of extra work, with extra columns being added to the original data frame. I need to do this operation for a large data frame repeatedly. Is there an easier (one-step) way to compute change_VAR
?
Thanks!
Upvotes: 4
Views: 2442
Reputation: 17279
group_by(DF, CAR) %>%
arrange(CAR, TIME) %>%
mutate(change_VAR = VAR-VAR[1]) %>%
ungroup() %>%
arrange(TIME, CAR)
I did a little more arranging that akrun, but I'm not assuming the data are properly ordered up front. Arranging also ensures that the earliest observation is at position 1 in each group.
EDIT: If you're interested in this kind of thing. Here's the speed comparison between mine and Akrun's answers. (short answer, go with Akrun)
microbenchmark(
akrun1 = DF %>%
group_by(CAR) %>%
mutate(change_VAR= VAR- min(VAR)),
akrun2 = DF %>%
group_by(CAR) %>%
mutate(change_VAR= VAR- VAR[TIME==0]),
Benjamin = group_by(DF, CAR) %>%
arrange(CAR, TIME) %>%
mutate(change_VAR = VAR-VAR[1]) %>%
ungroup() %>%
arrange(TIME, CAR))
unit: microseconds
expr min lq mean median uq max neval cld
akrun1 887.360 903.1950 975.6354 930.467 954.5125 3246.223 100 a
akrun2 888.240 901.1425 947.5220 913.899 952.7530 3002.536 100 a
Benjamin 1584.697 1604.7835 1720.8034 1662.993 1692.1705 3765.561 100 b
Upvotes: 3
Reputation: 887691
This could be done by taking the difference of 'VAR' with the min
of 'VAR' grouped by 'CAR'.
local_test %>%
group_by(CAR) %>%
mutate(change_VAR= VAR- min(VAR))
Or if the base value of 'VAR' is when 'TIME' is 0 (assuming there are no duplicates of 'TIME' per each group), we subset the 'VAR' for TIME 0 and get the difference.
local_test %>%
group_by(CAR) %>%
mutate(change_VAR= VAR- VAR[TIME==0])
Upvotes: 6