Reputation: 29
My aim is to create a new df column for which the values are based on two other columns. My data set concerns recruitment into a study. I would like a column that defines whether or not a person was in a particular round of the study, if so was it their first involvement, their second, third and so on (up to 8 rounds). Currently I am attempting this with mutate(case_when))
in dplyr and using lag()
. However, it works incorrectly if a person missed a round of the study and later came back into it. The data set looks like this:
person | round | in_round |
A 1 1
A 2 1
A 3 1
A 4 1
A 5 1
A 6 0
A 7 0
A 8 0
B 1 0
B 2 0
B 3 1
B 4 1
B 5 1
B 6 1
B 7 0
B 8 1
What I need is a separate column that uses round
and in_round
for each person to produce the following:
person | round | in_round | round_status
A 1 1 recruited
A 2 1 follow_up_1
A 3 1 follow_up_2
A 4 1 follow_up_3
A 5 1 follow_up_4
A 6 0 none
A 7 0 none
A 8 0 none
B 1 0 none
B 2 0 none
B 3 1 recruited
B 4 1 follow_up_1
B 5 1 follow_up_2
B 6 1 follow_up_3
B 7 0 none
B 8 1 follow_up_4
In summary:
in_round == 0
, round_status == "none"
in_round == 1
, round_status == "recruited"
in_round == 1
, round_status == "follow_up_X"
(dependent on the number of previous waves the individual was present in).Upvotes: 2
Views: 967
Reputation: 145745
Try this:
df %>%
group_by(person) %>%
arrange(round) %>%
mutate(cum_round = cumsum(in_round),
round_status = case_when(
in_round == 0 ~ "none",
cum_round == 1 ~ "recruited",
TRUE ~ paste0("follow_up_", cum_round - 1)
))
Upvotes: 3