Reputation: 13
I have a data frame that requires conditional recoding of a column based on the date listed in certain rows for each subset of IDs. I am trying to figure out how to best achieve this using the mutate function in dplyr. Suggestions and alternate solutions are welcome, but I would like to avoid using for loops.
I know how to write a really verbose and inefficient for loop that would solve this problem, but would like to know how to do it more efficiently.
The sample data frame:
df<-data.frame(ID = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2),
date = as.Date(c("2016-02-01","2016-02-01","2016-02-01","2016-03-21", "2016-03-21", "2016-03-21", "2016-10-05", "2016-10-05", "2016-10-05", "2016-10-05", "2016-03-01","2016-03-01","2016-03-01","2016-04-21", "2016-04-21", "2016-04-21", "2016-11-05", "2016-11-05", "2016-11-05", "2016-11-05")),
trial = c(NA, NA, NA, 1, 1, 1, NA, NA, NA, NA, NA, NA, NA, 1, 1, 1, NA, NA, NA, NA)
My pseudo code - the second logical argument in the first two case_when statements is where I am stuck.
df%>%
group_by(ID)%>%
mutate(results = case_when(
is.na(trial) & date < date where trial = 1 ~ 0,
is.na(trial) & date > date where trial = 1 ~ 2,
trial == trial
))
The expected result being:
data.frame(ID = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2),
date = as.Date(c("2016-02-01","2016-02-01","2016-02-01","2016-03-21", "2016-03-21", "2016-03-21", "2016-10-05", "2016-10-05", "2016-10-05", "2016-10-05", "2016-03-01","2016-03-01","2016-03-01","2016-04-21", "2016-04-21", "2016-04-21", "2016-11-05", "2016-11-05", "2016-11-05", "2016-11-05")),
trial = c(0, 0, 0, 1, 1, 1, 2, 2, 2, 2, 0, 0, 0, 1, 1, 1, 2, 2, 2, 2)
)
Upvotes: 1
Views: 156
Reputation: 887721
An option would be to group by 'ID' and transform the 'trial' by applying the run-length-id on (rleid
) on the 'trial' column
library(dplyr)
library(data.table)
df %>%
group_by(ID) %>%
mutate(trial = rleid(trial)-1)
# A tibble: 20 x 3
# Groups: ID [2]
# ID date trial
# <dbl> <date> <dbl>
# 1 1 2016-02-01 0
# 2 1 2016-02-01 0
# 3 1 2016-02-01 0
# 4 1 2016-03-21 1
# 5 1 2016-03-21 1
# 6 1 2016-03-21 1
# 7 1 2016-10-05 2
# 8 1 2016-10-05 2
# 9 1 2016-10-05 2
#10 1 2016-10-05 2
#11 2 2016-03-01 0
#12 2 2016-03-01 0
#13 2 2016-03-01 0
#14 2 2016-04-21 1
#15 2 2016-04-21 1
#16 2 2016-04-21 1
#17 2 2016-11-05 2
#18 2 2016-11-05 2
#19 2 2016-11-05 2
#20 2 2016-11-05 2
Or using rle
df %>%
group_by(ID) %>%
mutate(trial = with(rle(is.na(trial)),
rep(seq_along(values), lengths))-1)
Upvotes: 2
Reputation: 389215
Converting your pseudo code to code we can use which.max(trial == 1)
to get first occurrence where trial = 1
for each group. This also assumes that there would be at least one entry of 1 in trial
for each ID
.
library(dplyr)
df %>%
group_by(ID) %>%
mutate(trial = case_when(is.na(trial) & date < date[which.max(trial == 1)] ~ 0,
is.na(trial) & date > date[which.max(trial == 1)] ~ 2,
TRUE ~ trial))
# ID date trial
# <dbl> <date> <dbl>
# 1 1 2016-02-01 0
# 2 1 2016-02-01 0
# 3 1 2016-02-01 0
# 4 1 2016-03-21 1
# 5 1 2016-03-21 1
# 6 1 2016-03-21 1
# 7 1 2016-10-05 2
# 8 1 2016-10-05 2
# 9 1 2016-10-05 2
#10 1 2016-10-05 2
#11 2 2016-03-01 0
#12 2 2016-03-01 0
#13 2 2016-03-01 0
#14 2 2016-04-21 1
#15 2 2016-04-21 1
#16 2 2016-04-21 1
#17 2 2016-11-05 2
#18 2 2016-11-05 2
#19 2 2016-11-05 2
#20 2 2016-11-05 2
Upvotes: 1