Reputation: 129
This is an extension of the R problem I asked earlier: How to select rows with certain values within a group in R
I got great help on that issue, but it got a bit more complicated now and I hope to receive advices how to handle this.
My Data looks like this:
dd <- read.table(text="
event.timeline.ys ID year group outcome
1 2 800033 2008 A 3
2 1 800033 2009 A 3
3 0 800033 2010 A NA
4 -1 800033 2011 A 2
5 -2 800033 2012 A 1
15 0 800076 2008 B 2
16 -1 800076 2009 B NA
17 5 800100 2014 C 4
18 4 800100 2015 C 4
19 2 800100 2017 C 4
20 1 800100 2018 C 3
30 0 800125 2008 A 2
31 -1 800125 2009 A 1
32 -2 800125 2010 A NA
33 2 800031 2008 A 3
34 1 800031 2009 A 3
35 0 800031 2010 A NA
36 -1 800031 2011 A NA
37 -2 800031 2012 A 1", header=TRUE)
I would like to select only special rows within a group (ID). These rows should be selected according to the following procedure:
If possible I would like to keep the last row with a positive value on event.timeline.ys for each participant (i.e., last row within an ID-group with event.timeline.ys >= 0) in which the outcome variable is not NA but has a valid value (e.g., for ID == 800033 this would be row 2).
Additionally, I would like to keep the first row with a negative value on event.timeline.ys for each participant (i.e., first row within an ID-group with event.timeline.ys < 0) in which the outcome variable is not NA (e.g., for ID == 800033 this would be row 4).
In the special case of ID == 800076 that does not have any non-NA values on the outcome variable when event.timeline.ys < 0, I would still like to keep the first row in which event.timeline.ys < 0.
The person with the ID = 800100 does not have any negative values on event.timeline.ys. In this case, I would like to keep only the last row with event.timeline.ys >= 0.
All other rows should be deleted. The final data frame would look like this:
event.timeline.ys ID year group outcome
2 1 800033 2009 A 3
4 -1 800033 2011 A 2
15 0 800076 2008 B 2
16 -1 800076 2009 B NA
20 1 800100 2018 C 3
30 0 800125 2008 A 2
31 -1 800125 2009 A 1
34 1 800031 2009 A 3
37 -2 800031 2012 A 1
I very much appreciate advices on how to solve this problem. I already tried this:
dd %>%
group_by(ID) %>%
filter(row_number() == last(which(event.timeline.ys >= 0 & outcome >= 0)) |
row_number() == first(which(event.timeline.ys < 0 & outcome >= 0)))
However, I then lose the row 16 (for ID == 800076) which is unfortunate.
Many thanks in advance!
Upvotes: 3
Views: 208
Reputation: 2022
Just to keep in-line with my previous answer using data.table
we can use ifelse
condition to select rows
library(data.table)
setDT(dd)
dd[, .SD[na.omit(c(ifelse(any(event.timeline.ys >= 0 & !is.na(outcome)),
last(which(event.timeline.ys >= 0 & !is.na(outcome))),
last(which(event.timeline.ys >= 0))),
ifelse(any(event.timeline.ys < 0 & !is.na(outcome)),
first(which(event.timeline.ys < 0 & !is.na(outcome))),
first(which(event.timeline.ys < 0)))))],
by=ID]
ID event.timeline.ys year group outcome
1: 800033 1 2009 A 3
2: 800033 -1 2011 A 2
3: 800076 0 2008 B 2
4: 800076 -1 2009 B NA
5: 800100 1 2018 C 3
6: 800125 0 2008 A 2
7: 800125 -1 2009 A 1
8: 800031 1 2009 A 3
9: 800031 -2 2012 A 1
Upvotes: 1
Reputation: 2399
Here's the solution using dplyr
and wrapr
's pipe %.>%
.
I'm adding outcome_na
and arranging by it to meet a condition "does not have any non-NA values".
library(dplyr)
library(wrapr)
dd %>%
group_by(ID) %>%
mutate(outcome_na = !is.na(outcome)) %.>%
bind_rows(
filter(., event.timeline.ys >= 0) %>% arrange(outcome_na, year) %>% slice(n()),
filter(., event.timeline.ys < 0) %>% arrange(desc(outcome_na), year) %>% slice(1)
) %>%
arrange(ID) %>%
select(-outcome_na)
Upvotes: 1
Reputation: 690
Using dplyr
:
dd %>%
group_by(ID, event.timeline.ys>=0) %>%
arrange(ID, event.timeline.ys>=0, abs(event.timeline.ys)) %>%
filter(!is.na(outcome) | n()==1) %>%
filter(row_number()==1) %>%
ungroup() %>%
select(-one_of('event.timeline.ys >= 0'))
Output:
event.timeline.ys ID year group outcome
<int> <int> <int> <fct> <int>
1 -1 800033 2011 A 2
2 1 800033 2009 A 3
3 -1 800076 2009 B NA
4 0 800076 2008 B 2
5 1 800100 2018 C 3
6 -1 800125 2009 A 1
7 0 800125 2008 A 2
Upvotes: 1