Reputation: 23717
data.frame(id = rep(letters[1:2], each = 8), examID = as.character(c(11,11:17, 21:28)), baselineExam = c(rep(c(rbind('bl',c(11,14))), each = 2), c(rep('bl', 4), 21, 21, 24, 24)), examType = c("x", "x", "fu", "fu", "y", "z", "fu", "fu", "x", "x", "y", "z", "fu", "fu", "fu", "fu"), expected_col = c("x", "x", "x", "x", "y", "z", "y", "y", "x", "x", "y", "z", "x", "x", "z", "z"))
# the first four columns are my original data frame
# the last column shows the expected output.
id examID baselineExam examType expected_col
1 a 11 bl x x
2 a 11 bl x x
3 a 12 11 fu x
4 a 13 11 fu x
5 a 14 bl y y
6 a 15 bl z z
7 a 16 14 fu y
8 a 17 14 fu y
9 b 21 bl x x
10 b 22 bl x x
11 b 23 bl y y
12 b 24 bl z z
13 b 25 21 fu x
14 b 26 21 fu x
15 b 27 24 fu z
16 b 28 24 fu z
There are different types of longitudinal exams (examType
) for each subject (id
). Each exam has its unique identifier (examID
). Only the baseline exam contains the information of the Examtype.
Follow up exams only contain the information 'fu', but not the correct examType. However, the baselineExam
column shows which was the baseline exam for the respective follow up. I would like to have a column with the correct examType in each row. (see expected_col
in the data frame)
I am stuck. I can't use something like if(baselineExam %in% examID)
in combination with any
, because there is no group based on which I can separate those exams.
A way would be to get the index of the row of the respective 'fu', look up which value this is in 'baselineExam' and then look for this number in 'examID' in order to get the examType of this row.
I tried a helper column with the index number (there must be a better way too), and I can get the value for the baseline exam in this row - but I don't know how to then conditionally get the value in examType, when examID (of the baseline-row) == baselineExam (of the follow up row).
base R solutions or dplyr preferred, but open for anything
edit
I have changed the given data because my previous sample data did not fully reflect the complexity of the real data (I oversimplified it). Unfortunately, neither @www nor @akrun 's solution worked therefore - my bad, because I haven't provided a good enough sample :( There can be multiple rows per exam ID (long data, rows 1 and 2 in my example), and multiple baseline exams after another before the follow up exams were taken.
Upvotes: 1
Views: 168
Reputation: 39154
A solution using the dplyr
and the tidyr
package. the key is to replace fu
with NA
and then use the fill
function to fill in the NA
with the previous row. mutate_if
is just converting factor column to a character column. dat2
is the final output.
library(dplyr)
library(tidyr)
dat2 <- dat %>%
mutate_if(is.factor, as.character) %>%
mutate(type = ifelse(examType %in% "fu", NA, examType)) %>%
fill(type)
dat2
# id examID baselineExam examType type
# 1 a 11 bl x x
# 2 a 12 11 fu x
# 3 a 13 bl y y
# 4 a 14 13 fu y
# 5 b 21 bl x x
# 6 b 22 21 fu x
# 7 b 23 bl z z
# 8 b 24 23 fu z
We can use the dplyr
package to achieve this. First, subset the data frame with known examType
, find the rows with the unique combination between id
, examID
, and examType
, join the table to the original data frame, and use coalesce
to combine information.
library(dplyr)
dat2 <- dat %>%
filter(!examType %in% "fu") %>%
distinct(id, examID, examType) %>%
rename(Type = examType) %>%
left_join(dat, ., by = c("id", "baselineExam" = "examID")) %>%
mutate(Type = coalesce(Type, examType))
dat2
# id examID baselineExam examType Type
# 1 a 11 bl x x
# 2 a 11 bl x x
# 3 a 12 11 fu x
# 4 a 13 11 fu x
# 5 a 14 bl y y
# 6 a 15 bl z z
# 7 a 16 14 fu y
# 8 a 17 14 fu y
# 9 b 21 bl x x
# 10 b 22 bl x x
# 11 b 23 bl y y
# 12 b 24 bl z z
# 13 b 25 21 fu x
# 14 b 26 21 fu x
# 15 b 27 24 fu z
# 16 b 28 24 fu z
dat <- data.frame(id = rep(letters[1:2], each = 8),
examID = as.character(c(11,11:17, 21:28)),
baselineExam = c(rep(c(rbind('bl',c(11,14))), each = 2), c(rep('bl', 4), 21, 21, 24, 24)),
examType = c("x", "x", "fu", "fu", "y", "z", "fu", "fu", "x", "x", "y", "z", "fu", "fu", "fu", "fu"),
stringsAsFactors = FALSE)
Upvotes: 4
Reputation: 886938
One option is to group by 'id' and create a grouping variable based on the occurrence of 'bl', create the 'type' as the 'examType' that corresponds to 'bl' in 'baselineExam'
library(dplyr)
df1 %>%
group_by(id, grp = cumsum(baselineExam == 'bl')) %>%
mutate(type = examType[baselineExam == 'bl']) %>%
ungroup %>%
select(-grp)
# A tibble: 8 x 5
# id examID baselineExam examType type
# <fct> <fct> <fct> <fct> <fct>
#1 a 11 bl x x
#2 a 12 11 fu x
#3 a 13 bl y y
#4 a 14 13 fu y
#5 b 21 bl x x
#6 b 22 21 fu x
#7 b 23 bl z z
#8 b 24 23 fu z
Upvotes: 2