Reputation: 43
The logical comparison with function any not return all combinations
I would like to filter to students who improved from "C" to "A" with no other grades in between. I would like to keep the list open as to include additional values in the list. e.g. "C" and "D" to "A". The dates may not always be sequential and sometimes repeat.
My second question is why Bob is missing from the filter.
library(dplyr)
library(lubridate)
name <- c(rep("adam",5), rep("bob",5), rep("charlie",5), rep("dave",5))
date <- lubridate::date(rep(seq.Date(as.Date("2015-01-01"),as.Date("2019-01-01"),"years"),4))
score <- c("C","B","A","C","A",
"A","B","A","B","C",
"C","A","B","A","C",
"B","A","A","A","C")
df <- data.frame(name,date,score)
df %>%
group_by(name) %>%
filter(any(date[score %in% c("A")] > date[score %in% c("C")]))%>%
filter(!any(date[score %in% c("B")] > date[score %in% c("C")] &
date[score %in% c("B")] < date[score %in% c("A")]))
# Charlie...
df %>%
group_by(name) %>%
filter(any(date[score %in% c("B")] < date[score %in% c("A")])) %>%
summarize
# 1 adam
# 2 charlie
# 3 dave
The first block should have included Adam. There are score "C" and "A" in sequence.
Bob is missing from the second block.
Upvotes: 0
Views: 92
Reputation: 43
Thanks to @Gregor
The lead
and lag
are exactly what I was looking for.
I confirm the lead
and lag
does not work if the dates repeat, and I need to arrange by date first.
grepl
was able to solve the issue of repeating dates.
Now I have to figure out how to uncollapse the scores.
Here are the codes for those interested
name <- c(rep("adam",3), rep("bob",3), rep("charlie",3), rep("dave",3))
date <- lubridate::date(rep(c(as.Date("2014-01-01"),as.Date("2015-01-01"),as.Date("2015-01-01")),4))
score <- c("C","B","A",
"C","A","B",
"C","A","A",
"C","B","B")
df <- data.frame(name,date,score)
df %>% group_by(name) %>%
filter(any(score %in% c("C") & lead(score %in% c("A")))) %>%
summarize
# A tibble: 2 x 1
# name
# <fct>
# 1 bob
# 2 charlie
df %>% group_by(name, date) %>%
summarize(scores = paste(score, collapse = "")) %>%
group_by(name) %>%
filter(any(grepl("C", scores) & lead(grepl("A", scores))))
# # A tibble: 6 x 3
# # Groups: name [3]
# name date scores
# <fct> <date> <chr>
# 1 adam 2014-01-01 C
# 2 adam 2015-01-01 BA
# 3 bob 2014-01-01 C
# 4 bob 2015-01-01 AB
# 5 charlie 2014-01-01 C
# 6 charlie 2015-01-01 AA
Regarding second question on pair-wise comparison.
I came to the same conclusion that >
does not do the pair-wise comparison.
This questions came into light as I was researching why any
function was not working as expected.
1:3 > 2:4
FALSE FALSE FALSE
I think expand.grid
can help but I'll need to learn more about functional programming, purrr package and learn how to tie them together.
filter(min(date[score %in% "B"]) < max(date[score %in% "A"]))
works perfectly for my purpose.
Upvotes: 0
Reputation: 145965
Your issue is that <
is does pair-wise comparisons. Using your second example:
df %>%
group_by(name) %>%
filter(any(date[score %in% c("B")] < date[score %in% c("A")])) %>%
summarize
## let's look at bob
# 6 bob 2015-01-01 A
# 7 bob 2016-01-01 B
# 8 bob 2017-01-01 A
# 9 bob 2018-01-01 B
# 10 bob 2019-01-01 C
## call this X
date[score %in% c("B")] # corresponds to two dates:
# 2016-01-01
# 2018-01-01
## and this Y
date[score %in% c("A")] # also two dates
# 2015-01-01
# 2017-01-01
X < Y ## returns FALSE FALSE
# because X[1] > Y[1] and X[2] > Y[2]
You could work around this, by doing something like:
df %>%
group_by(name) %>%
# is the first B before the last A
filter(min(date[score %in% "B"]) < max(date[score %in% "A"])) %>%
summarize
# name
# <fct>
# 1 adam
# 2 bob
# 3 charlie
# 4 dave
For the consecutive changes, I would suggest using lead
or lag
. If the dates are not in sequence, use arrange
first to put them in sequence:
df %>% group_by(name) %>%
filter(any(score %in% c("A") & lead(score) %in% c("C"))) %>%
summarize
# name
# <fct>
# 1 adam
# 2 charlie
# 3 dave
But this won't work if the dates repeat. The easiest way I can think of to account for repeating dates is to collapse them into one row, something like
df %>% group_by(name, date) %>%
summarize(scores = paste(score, collapse = "")) %>%
group_by(name) %>%
filter(grepl("A", scores) & lead(grepl("C", scores)) %>%
summarize
# # A tibble: 3 x 1
# name
# <fct>
# 1 adam
# 2 charlie
# 3 dave
I didn't test on on data with multiple dates, but it should work.
Upvotes: 1