Joseph Fann
Joseph Fann

Reputation: 43

logical comparison with any on dplyr filtered data

The logical comparison with function any not return all combinations

I would like to filter to students who improved from "C" to "A" with no other grades in between. I would like to keep the list open as to include additional values in the list. e.g. "C" and "D" to "A". The dates may not always be sequential and sometimes repeat.

My second question is why Bob is missing from the filter.

library(dplyr)
library(lubridate)
name <- c(rep("adam",5), rep("bob",5), rep("charlie",5), rep("dave",5))
date <- lubridate::date(rep(seq.Date(as.Date("2015-01-01"),as.Date("2019-01-01"),"years"),4))
score <- c("C","B","A","C","A",
       "A","B","A","B","C",
       "C","A","B","A","C",
       "B","A","A","A","C")
df <- data.frame(name,date,score)

df %>%
  group_by(name) %>%
  filter(any(date[score %in% c("A")] > date[score %in% c("C")]))%>%
  filter(!any(date[score %in% c("B")] > date[score %in% c("C")] &
              date[score %in% c("B")] < date[score %in% c("A")]))

# Charlie...

df %>%
  group_by(name) %>%
  filter(any(date[score %in% c("B")] < date[score %in% c("A")])) %>%
  summarize
# 1 adam   
# 2 charlie
# 3 dave   

The first block should have included Adam. There are score "C" and "A" in sequence.

Bob is missing from the second block.

Upvotes: 0

Views: 92

Answers (2)

Joseph Fann
Joseph Fann

Reputation: 43

Thanks to @Gregor

The lead and lag are exactly what I was looking for. I confirm the lead and lag does not work if the dates repeat, and I need to arrange by date first. grepl was able to solve the issue of repeating dates. Now I have to figure out how to uncollapse the scores. Here are the codes for those interested

name <- c(rep("adam",3), rep("bob",3), rep("charlie",3), rep("dave",3))
date <- lubridate::date(rep(c(as.Date("2014-01-01"),as.Date("2015-01-01"),as.Date("2015-01-01")),4))
score <- c("C","B","A",
           "C","A","B",
           "C","A","A",
           "C","B","B")
df <- data.frame(name,date,score)

df %>% group_by(name) %>%
  filter(any(score %in% c("C") & lead(score %in% c("A")))) %>%
  summarize

# A tibble: 2 x 1
#  name   
#  <fct>  
# 1 bob    
# 2 charlie

df %>% group_by(name, date) %>%
  summarize(scores = paste(score, collapse = "")) %>%
  group_by(name) %>%
  filter(any(grepl("C", scores) & lead(grepl("A", scores))))

# # A tibble: 6 x 3
# # Groups:   name [3]
# name    date       scores
# <fct>   <date>     <chr> 
#   1 adam    2014-01-01 C     
# 2 adam    2015-01-01 BA    
# 3 bob     2014-01-01 C     
# 4 bob     2015-01-01 AB    
# 5 charlie 2014-01-01 C     
# 6 charlie 2015-01-01 AA  

Regarding second question on pair-wise comparison.

I came to the same conclusion that > does not do the pair-wise comparison. This questions came into light as I was researching why any function was not working as expected.

1:3 > 2:4
FALSE FALSE FALSE

I think expand.grid can help but I'll need to learn more about functional programming, purrr package and learn how to tie them together.

filter(min(date[score %in% "B"]) < max(date[score %in% "A"]))
works perfectly for my purpose.

Upvotes: 0

Gregor Thomas
Gregor Thomas

Reputation: 145965

Your issue is that < is does pair-wise comparisons. Using your second example:

df %>%
  group_by(name) %>%
  filter(any(date[score %in% c("B")] < date[score %in% c("A")])) %>%
  summarize

## let's look at bob
# 6      bob 2015-01-01     A
# 7      bob 2016-01-01     B
# 8      bob 2017-01-01     A
# 9      bob 2018-01-01     B
# 10     bob 2019-01-01     C

## call this X
date[score %in% c("B")] # corresponds to two dates:
# 2016-01-01
# 2018-01-01

## and this Y
date[score %in% c("A")] # also two dates
# 2015-01-01
# 2017-01-01

X < Y ## returns FALSE FALSE
# because X[1] > Y[1] and X[2] > Y[2]

You could work around this, by doing something like:

df %>%
  group_by(name) %>%
  # is the first B before the last A
  filter(min(date[score %in% "B"]) < max(date[score %in% "A"])) %>%
  summarize
#   name   
#   <fct>  
# 1 adam   
# 2 bob    
# 3 charlie
# 4 dave   

For the consecutive changes, I would suggest using lead or lag. If the dates are not in sequence, use arrange first to put them in sequence:

df %>% group_by(name) %>%
  filter(any(score %in% c("A") & lead(score) %in% c("C"))) %>%
  summarize
#   name   
#   <fct>  
# 1 adam   
# 2 charlie
# 3 dave

But this won't work if the dates repeat. The easiest way I can think of to account for repeating dates is to collapse them into one row, something like

df %>% group_by(name, date) %>%
  summarize(scores = paste(score, collapse = "")) %>%
  group_by(name) %>%
  filter(grepl("A", scores) & lead(grepl("C", scores)) %>% 
  summarize
# # A tibble: 3 x 1
#   name   
#   <fct>  
# 1 adam   
# 2 charlie
# 3 dave  

I didn't test on on data with multiple dates, but it should work.

Upvotes: 1

Related Questions