Reputation: 129
If possible, I would like to select the last two rows of each group (ID) that have a valid value (i.e., not NA) on my outcome variable (outcome).
Sample data looks like this:
df <- read.table(text="
ID outcome
1 800033 3
2 800033 3
3 800033 NA
4 800033 2
5 800033 1
15 800076 2
16 800076 NA
17 800100 4
18 800100 4
19 800100 4
20 800100 3
30 800125 2
31 800125 1
32 800125 NA", header=TRUE)
In the case that a participant does not have two valid values on my outcome variable (e.g., ID == 800076), I would still like to keep the last two rows of this group (ID). All other rows should be deleted.
My final data set would therefore look like this:
ID outcome
4 800033 2
5 800033 1
15 800076 2
16 800076 NA
19 800100 4
20 800100 3
30 800125 2
31 800125 1
Any advices on how to do this are highly appreciated!
Upvotes: 0
Views: 143
Reputation: 887901
We can do this with dplyr
library(dplyr)
df %>%
group_by(ID) %>%
filter(n() <=2 | !is.na(outcome) ) %>%
slice(tail(row_number(), 2))
# A tibble: 8 x 2
# Groups: ID [4]
# ID outcome
# <int> <int>
#1 800033 2
#2 800033 1
#3 800076 2
#4 800076 NA
#5 800100 4
#6 800100 3
#7 800125 2
#8 800125 1
Upvotes: 0
Reputation: 389275
We can have an if
condition for slice
and check if number of rows is greater than 2 and select the rows based on that condition.
library(dplyr)
df %>%
group_by(ID) %>%
slice(if (n() > 2) tail(which(!is.na(outcome)), 2) else 1:n())
# ID outcome
# <int> <int>
#1 800033 2
#2 800033 1
#3 800076 2
#4 800076 NA
#5 800100 4
#6 800100 3
#7 800125 2
#8 800125 1
Upvotes: 1