logical comparison with any on dplyr filtered data

Question

The logical comparison with function any not return all combinations

I would like to filter to students who improved from "C" to "A" with no other grades in between. I would like to keep the list open as to include additional values in the list. e.g. "C" and "D" to "A". The dates may not always be sequential and sometimes repeat.

My second question is why Bob is missing from the filter.

library(dplyr)
library(lubridate)
name <- c(rep("adam",5), rep("bob",5), rep("charlie",5), rep("dave",5))
date <- lubridate::date(rep(seq.Date(as.Date("2015-01-01"),as.Date("2019-01-01"),"years"),4))
score <- c("C","B","A","C","A",
       "A","B","A","B","C",
       "C","A","B","A","C",
       "B","A","A","A","C")
df <- data.frame(name,date,score)

df %>%
  group_by(name) %>%
  filter(any(date[score %in% c("A")] > date[score %in% c("C")]))%>%
  filter(!any(date[score %in% c("B")] > date[score %in% c("C")] &
              date[score %in% c("B")] < date[score %in% c("A")]))

# Charlie...

df %>%
  group_by(name) %>%
  filter(any(date[score %in% c("B")] < date[score %in% c("A")])) %>%
  summarize
# 1 adam   
# 2 charlie
# 3 dave

The first block should have included Adam. There are score "C" and "A" in sequence.

Bob is missing from the second block.

Gregor Thomas · Accepted Answer

Your issue is that < is does pair-wise comparisons. Using your second example:

df %>%
  group_by(name) %>%
  filter(any(date[score %in% c("B")] < date[score %in% c("A")])) %>%
  summarize

## let's look at bob
# 6      bob 2015-01-01     A
# 7      bob 2016-01-01     B
# 8      bob 2017-01-01     A
# 9      bob 2018-01-01     B
# 10     bob 2019-01-01     C

## call this X
date[score %in% c("B")] # corresponds to two dates:
# 2016-01-01
# 2018-01-01

## and this Y
date[score %in% c("A")] # also two dates
# 2015-01-01
# 2017-01-01

X < Y ## returns FALSE FALSE
# because X[1] > Y[1] and X[2] > Y[2]

You could work around this, by doing something like:

df %>%
  group_by(name) %>%
  # is the first B before the last A
  filter(min(date[score %in% "B"]) < max(date[score %in% "A"])) %>%
  summarize
#   name   
#     
# 1 adam   
# 2 bob    
# 3 charlie
# 4 dave

For the consecutive changes, I would suggest using lead or lag. If the dates are not in sequence, use arrange first to put them in sequence:

df %>% group_by(name) %>%
  filter(any(score %in% c("A") & lead(score) %in% c("C"))) %>%
  summarize
#   name   
#     
# 1 adam   
# 2 charlie
# 3 dave

But this won't work if the dates repeat. The easiest way I can think of to account for repeating dates is to collapse them into one row, something like

df %>% group_by(name, date) %>%
  summarize(scores = paste(score, collapse = "")) %>%
  group_by(name) %>%
  filter(grepl("A", scores) & lead(grepl("C", scores)) %>% 
  summarize
# # A tibble: 3 x 1
#   name   
#     
# 1 adam   
# 2 charlie
# 3 dave

I didn't test on on data with multiple dates, but it should work.

logical comparison with any on dplyr filtered data

Answers (2)

Related Questions