rjen
rjen

Reputation: 1972

%within% in case_when() under group_modify() not working

I have the following kind of data:

library(tidyverse)
library(lubridate)


data <- tibble(a = c(1, 1, 2, 3, 3, 3, 3),
               b = c('x', 'y', 'z', 'z', 'z', 'z', 'z'),
               c = c('ps', 'ps', 'qs', 'rs', 'rs', 'rs', 'rs'),
               d = c(100, 200, 300, 400, 500, 600, 700),
               strt = ymd(c('2019-03-20', '2020-01-01', '2018-01-02', '2020-05-01', '2016-01-01', '2020-03-01', '2020-01-01')),
               fnsh = ymd(c('3019-03-20', '3020-01-01', '3018-01-02', '2020-06-01', '2016-05-01', '2020-04-01', '2020-06-10')))

I am doing a group-wise operation based on the variables a, b and c (i.e. data %>% group_by(a, b, c)) using group_modify(). For each group, I need to find the rows with genuine starting dates within the last year. A strt is genuine if it doesn't fall between the strt and fnsh of any other row in the group. My current approach is:

test <- data %>%
  group_by(a, b, c) %>%
  group_modify(function(.x, .y) {
               .x %>%
               mutate(startLatestYear = case_when(strt > today(tzone = 'CET') - years(1) &
                                                  strt <= today(tzone = 'CET') &
                                                  !strt %within% (.x %>%
                                                                  mutate(pushInterval = interval(strt + days(1), fnsh)) %>%
                                                                  select(pushInterval)) ~ 1,
                                                  TRUE ~ 0))}) %>%
  ungroup()

This approach gives:

data <- tibble(a = c(1, 1, 2, 3, 3, 3, 3),
               b = c('x', 'y', 'z', 'z', 'z', 'z', 'z'),
               c = c('ps', 'ps', 'qs', 'rs', 'rs', 'rs', 'rs'),
               d = c(100, 200, 300, 400, 500, 600, 700),
               strt = ymd(c('2019-03-20', '2020-01-01', '2018-01-02', '2020-05-01', '2016-01-01', '2020-03-01', '2020-01-01')),
               fnsh = ymd(c('3019-03-20', '3020-01-01', '3018-01-02', '2020-06-01', '2016-05-01', '2020-04-01', '2020-06-10')),
               startLatestYear = c(0, 1, 0, 1, 0, 1, 1))

What is needed is:

data <- tibble(a = c(1, 1, 2, 3, 3, 3, 3),
               b = c('x', 'y', 'z', 'z', 'z', 'z', 'z'),
               c = c('ps', 'ps', 'qs', 'rs', 'rs', 'rs', 'rs'),
               d = c(100, 200, 300, 400, 500, 600, 700),
               strt = ymd(c('2019-03-20', '2020-01-01', '2018-01-02', '2020-05-01', '2016-01-01', '2020-03-01', '2020-01-01')),
               fnsh = ymd(c('3019-03-20', '3020-01-01', '3018-01-02', '2020-06-01', '2016-05-01', '2020-04-01', '2020-06-10')),
               startLatestYear = c(0, 1, 0, 0, 0, 0, 1))

The group based on a == 3, b == 'z' and c == 'rs' has a row (the very last row) that should be the only row in the group with 1 in startLatestYear. The very last row is the only row in the group which has strt within the latest year and strt outside the intervals from the other rows in the group.

The first two conditions in the present use of case_when() seem to work. The third condition using %within% does not seem to work. How can the condition using %within% come to work? Or how can an alternative solution be implemented?

PS: I have tried making pushInterval before grouping the tibble. Doing so produces the same column for startLatestYear, but the operation leads to the 'problem' of bind_rows_() stripping away the interval attributes. Hence the current solution that produces pushInterval on the fly.

Upvotes: 1

Views: 157

Answers (1)

r2evans
r2evans

Reputation: 160407

I don't think you need to use group_modify, this works in a simple group mutate:

data %>%
  group_by(a, b, c) %>%
  mutate(x = +(purrr::map_lgl(strt, ~ sum(strt <= .x & .x <= fnsh) < 2) &
                 difftime(Sys.time(), strt, "days") < 365)) %>%
  ungroup()
# # A tibble: 7 x 7
#       a b     c         d strt       fnsh           x
#   <dbl> <chr> <chr> <dbl> <date>     <date>     <int>
# 1     1 x     ps      100 2019-03-20 3019-03-20     0
# 2     1 y     ps      200 2020-01-01 3020-01-01     1
# 3     2 z     qs      300 2018-01-02 3018-01-02     0
# 4     3 z     rs      400 2020-05-01 2020-06-01     0
# 5     3 z     rs      500 2016-01-01 2016-05-01     0
# 6     3 z     rs      600 2020-03-01 2020-04-01     0
# 7     3 z     rs      700 2020-01-01 2020-06-10     1

.x is the placeholder for the parameter passed as the first argument to map_lgl. In this case, it's also strt, but let's forget about that for a moment.

Inside of the tilde-function, strt refers to the whole vector, and .x is referring to each individual strt value (it is always length 1). strt <= .x the first time is effectively strt <= strt[1]. The sum just counts how many of the occurrences are true. (There should always be one, since a number will always be within its own range.)

Upvotes: 1

Related Questions