user6550364
user6550364

Reputation:

R: Dplyr Lagging Variables after Grouping by Multiple Columns

I want to calculate the score difference after grouping by Year, State, Tier, Group. A stylised representation of my data would look like:

dat2 <- data.frame(
Year = sample(1990:1996, 10, replace = TRUE),
State = sample(c("AL", "CA", "NY"), 10, replace = TRUE),
Tier = sample(1:2),
Group = sample(c("A", "B"), 10, replace = TRUE),
Score = rnorm(10))

I tried mutate with group_by_ and .dots however it obtains values from the next absolute value (i.e. grouping does not seem to work). I am mostly interested in plotting the yearly differences (ala time-series even though some years would be NA) so this can be solved by either lagging or calculating the next year's score.

Edit: So, if the dataset looks like:

Year    State    Tier    Group    Score
1990    AL       1       A        75
1990    AL       2       A        100
1990    AL       1       B        5
1990    AL       2       B        10
1991    AL       1       A        95
1991    AL       2       A        80
1991    AL       1       B        5
1991    AL       2       B        15

The desired end result would be:

Year    State    Tier    Group    Score   Diff
1991    AL       1       A        95      20     
1991    AL       1       B        5       0  
1991    AL       2       A        80      -20
1991    AL       2       B        15      5

Upvotes: 2

Views: 4468

Answers (1)

Constantinos
Constantinos

Reputation: 1327

If I understand correctly, you are trying to calculate the difference in Score within each combination of Year, State, Tier, Group? Presumably, your data will be sorted chronologically for the difference to make any sense. Your example is small for these combinations to be repeated but I believe the solution you are looking for would be:

library(dplyr)
dat2 %>% 
 arrange(Year) %>%
 group_by(State, Tier, Group) %>%
 mutate(ScoreDiff = Score - lag(Score))

With your current code, the ScoreDiff column has a lot of NAs because there usually won't be multiple cases of the same combination of your four variables in just 10 cases. But you can try it with a more general code (I've also changed the starting year to 1890 from 1990):

n <- 100

dat2 <- data.frame(
  Year = sample(1890:1996, n, replace = TRUE),
  State = sample(c("AL", "CA", "NY"), n, replace = TRUE),
  Tier = sample(1:2),
  Group = sample(c("A", "B"), n, replace = TRUE),
  Score = rnorm(n))

dat2 %>%
  arrange(Year) %>%
  group_by(State, Tier, Group) %>%
  mutate(ScoreDiff = Score - lag(Score))

Upvotes: 4

Related Questions