rjen
rjen

Reputation: 1972

Using (.) in case_when() as part of mutate() on grouped tibble

I have the following kind of data:

library(tidyverse)
library(lubridate)

data <- tibble(a = c(1, 1, 2, 3, 3),
               b = c('x', 'y', 'z', 'z', 'z'),
               c = c('ps', 'ps', 'qs', 'rs', 'rs'),
               d = c(100, 200, 300, 400, 500),
               strt = ymd(c('2019-03-20', '2020-01-01', '2018-01-02', '2020-05-01', '2016-01-01')),
               fnsh = ymd(c(NA, NA, NA, '2020-06-01', '2016-05-01')))

The operation has to apply to the data as grouped by a, b, c (i.e. data %>% group_by(a, b, c)). I want to add a column that shows whether or not a group has a start within the latest year. To have a start within the latest year, a group has to:

1) Have a row with strt within the latest year

2) Not have a row with strt before the latest year and fnsh as NA (no disqualifying overlap)

3) Not have a row with strt before the latest year and fnsh as equal to or later than the latest of all entries in strt (no disqualifying overlap)

I am thus trying to get:

tibble(a = c(1, 1, 2, 3, 3),
       b = c('x', 'y', 'z', 'z', 'z'),
       c = c('ps', 'ps', 'qs', 'rs', 'rs'),
       d = c(100, 200, 300, 400, 500),
       strt = ymd(c('2019-03-20', '2020-01-01', '2018-01-02', '2020-05-01', '2016-01-01')),
       fnsh = ymd(c(NA, NA, NA, '2020-06-01', '2016-05-01')),
       startLatestYear = c(0, 1, 0, 1, 1))

My current approach is:

test <- data %>%
  group_by(a, b, c) %>%
  mutate(startLatestYear = case_when(all(is.na(fnsh)) &
                                     min(strt) > today(tzone = 'CET') - years(1) &
                                     min(strt) <= today(tzone = 'CET') ~ 1,
                                     strt > today(tzone = 'CET') - years(1) &
                                     strt <= today(tzone = 'CET') &
                                     nrow(filter(., strt < today(tzone = 'CET') - years(1) &
                                                    fnsh %in% NA)) == 0 &
                                     nrow(filter(., strt < today(tzone = 'CET') - years(1))) > 0 &
                                     strt > max(pull(filter(., strt < today(tzone = 'CET') - years(1)), fnsh)) ~ 1,
                                     TRUE ~ 0))

The first if in my use of case_when() seems to work, but the second does not. I suspect that my use of . is wrong. How can I get the desired output?

Upvotes: 3

Views: 206

Answers (1)

Hong Ooi
Hong Ooi

Reputation: 57686

. is a facility provided by the magrittr package, where it refers to the left-hand side of the %>% operator. %>% knows nothing about dplyr verbs, so when you use . inside the mutate, it simply expands to the object that was piped in. In the case of a grouped df, that means the entire df, not the grouped subsets.

The best solution I've found so far is to replace the mutate with a group_modify:

data %>%
    group_by(a, b, c) %>%
    group_modify(function(.x, .y)
    {
        .x %>% mutate(startLatestYear=case_when(...))
    })

This works because now the pipeline inside group_modify is executed separately for each group.

Upvotes: 1

Related Questions