jacoby
jacoby

Reputation: 31

R How to lag a dataframe by groups

I have the following data set:

Name  Year  VarA  VarB Data.1  Data.2
A     2016  L     H    100     101
A     2017  L     H    105     99
A     2018  L     H    103     105
A     2016  L     A    90      95
A     2017  L     A    99      92
A     2018  L     A    102     101

I want to add a lagged variable by the grouping: Name, VarA, VarB so that my data would look like:

Name  Year  VarA  VarB Data.1  Data.2  Lg1.Data.1 Lg2.Data.1
A     2016  L     H    100     101     NA        NA
A     2017  L     H    105     99      100       NA
A     2018  L     H    103     105     105       100
A     2016  L     A    90      95      NA        NA
A     2017  L     A    99      92      90        NA
A     2018  L     A    102     101     99        90

I found the following link, which is helpful: debugging: function to create multiple lags for multiple columns (dplyr)

And am using the following code:

df <- df %>% 
  group_by(Name) %>% 
  arrange(Name, VarA, VarB, Year) %>% 
  do(data.frame(., setNames(shift(.[,c(5:6)], 1:2), c(seq(1:8)))))

However, the lag offsetting all data associated w/ name, instead of the grouping I want, so only the 2018 years are accurately lagged.

Name  Year  VarA  VarB Data.1  Data.2  Lg1.Data.1 Lg2.Data.1
A     2016  L     H    100     101     NA        NA
A     2017  L     H    105     99      100       NA
A     2018  L     H    103     105     105       100
A     2016  L     A    90      95      103       105
A     2017  L     A    99      92      90        103
A     2018  L     A    102     101     99        90

How do I get the lag to reset for each new grouping combination (e.g. Name / VarA / VarB)?

Upvotes: 2

Views: 5989

Answers (1)

camille
camille

Reputation: 16832

dplyr::lag lets you set the distance you want to lag by. You can group by whatever variables you want—in this case, Name, VarA, and VarB—before making your lagged variables.

library(dplyr)

df %>%
  group_by(Name, VarA, VarB) %>%
  mutate(Lg1.Data.1 = lag(Data.1, n = 1), Lg2.Data.1 = lag(Data.1, n = 2))
#> # A tibble: 6 x 8
#> # Groups:   Name, VarA, VarB [2]
#>   Name   Year VarA  VarB  Data.1 Data.2 Lg1.Data.1 Lg2.Data.1
#>   <chr> <dbl> <chr> <chr>  <dbl>  <dbl>      <dbl>      <dbl>
#> 1 A      2016 L     H        100    101         NA         NA
#> 2 A      2017 L     H        105     99        100         NA
#> 3 A      2018 L     H        103    105        105        100
#> 4 A      2016 L     A         90     95         NA         NA
#> 5 A      2017 L     A         99     92         90         NA
#> 6 A      2018 L     A        102    101         99         90

If you want a version that scales to more lags, you can use some non-standard evaluation to create new lagged columns dynamically. I'll do this with purrr::map to iterate of a set of n to lag by, make a list of data frames with the new columns added, then join all the data frames together. There are probably better NSE ways to do this, so hopefully someone can improve upon it.

I'm making up some new data, just to have a wider range of years to illustrate. Inside mutate, you can create column names with quo_name.

library(dplyr)
library(purrr)

set.seed(127)
df <- tibble(
  Name = "A", Year = rep(2016:2020, 2), VarA = "L", VarB = rep(c("H", "A"), each = 5),
  Data.1 = sample(1:10, 10, replace = T), Data.2 = sample(1:10, 10, replace = T) 
)

df_list <- purrr::map(1:4, function(i) {
  df %>%
    group_by(Name, VarA, VarB) %>%
    mutate(!!quo_name(paste0("Lag", i)) := dplyr::lag(Data.1, n = i))
})

You don't need to save this list—I'm just doing it to show an example of one of the data frames. You could instead go straight into reduce.

df_list[[3]]
#> # A tibble: 10 x 7
#> # Groups:   Name, VarA, VarB [2]
#>    Name   Year VarA  VarB  Data.1 Data.2  Lag3
#>    <chr> <int> <chr> <chr>  <int>  <int> <int>
#>  1 A      2016 L     H          3      9    NA
#>  2 A      2017 L     H          1      4    NA
#>  3 A      2018 L     H          3      8    NA
#>  4 A      2019 L     H          2      2     3
#>  5 A      2020 L     H          4      5     1
#>  6 A      2016 L     A          8      4    NA
#>  7 A      2017 L     A          6      8    NA
#>  8 A      2018 L     A          3      2    NA
#>  9 A      2019 L     A          8      6     8
#> 10 A      2020 L     A          9      1     6

Then use purrr::reduce to join all the data frames in the list. Since there are columns that are the same in each of the data frames, and those are the ones you want to join by, you can get away with not specifying join-by columns in inner_join.

reduce(df_list, inner_join)
#> Joining, by = c("Name", "Year", "VarA", "VarB", "Data.1", "Data.2")
#> Joining, by = c("Name", "Year", "VarA", "VarB", "Data.1", "Data.2")
#> Joining, by = c("Name", "Year", "VarA", "VarB", "Data.1", "Data.2")
#> # A tibble: 10 x 10
#> # Groups:   Name, VarA, VarB [?]
#>    Name   Year VarA  VarB  Data.1 Data.2  Lag1  Lag2  Lag3  Lag4
#>    <chr> <int> <chr> <chr>  <int>  <int> <int> <int> <int> <int>
#>  1 A      2016 L     H          3      9    NA    NA    NA    NA
#>  2 A      2017 L     H          1      4     3    NA    NA    NA
#>  3 A      2018 L     H          3      8     1     3    NA    NA
#>  4 A      2019 L     H          2      2     3     1     3    NA
#>  5 A      2020 L     H          4      5     2     3     1     3
#>  6 A      2016 L     A          8      4    NA    NA    NA    NA
#>  7 A      2017 L     A          6      8     8    NA    NA    NA
#>  8 A      2018 L     A          3      2     6     8    NA    NA
#>  9 A      2019 L     A          8      6     3     6     8    NA
#> 10 A      2020 L     A          9      1     8     3     6     8

Created on 2018-12-07 by the reprex package (v0.2.1)

Upvotes: 2

Related Questions