Conditional statement referencing separate rows

Question

data.frame(id = rep(letters[1:2], each  = 8), examID = as.character(c(11,11:17, 21:28)), baselineExam = c(rep(c(rbind('bl',c(11,14))), each = 2), c(rep('bl', 4), 21, 21, 24, 24)), examType = c("x", "x", "fu", "fu", "y", "z", "fu", "fu", "x", "x", "y", "z", "fu", "fu", "fu", "fu"), expected_col = c("x", "x", "x", "x", "y", "z", "y", "y", "x", "x", "y", "z", "x", "x", "z", "z"))

# the first four columns are my original data frame
# the last column shows the expected output.     

    id examID baselineExam examType expected_col
1   a     11           bl        x            x
2   a     11           bl        x            x
3   a     12           11       fu            x
4   a     13           11       fu            x
5   a     14           bl        y            y
6   a     15           bl        z            z
7   a     16           14       fu            y
8   a     17           14       fu            y
9   b     21           bl        x            x
10  b     22           bl        x            x
11  b     23           bl        y            y
12  b     24           bl        z            z
13  b     25           21       fu            x
14  b     26           21       fu            x
15  b     27           24       fu            z
16  b     28           24       fu            z

There are different types of longitudinal exams (examType) for each subject (id). Each exam has its unique identifier (examID). Only the baseline exam contains the information of the Examtype. Follow up exams only contain the information 'fu', but not the correct examType. However, the baselineExam column shows which was the baseline exam for the respective follow up. I would like to have a column with the correct examType in each row. (see expected_col in the data frame)

I am stuck. I can't use something like if(baselineExam %in% examID) in combination with any, because there is no group based on which I can separate those exams.

A way would be to get the index of the row of the respective 'fu', look up which value this is in 'baselineExam' and then look for this number in 'examID' in order to get the examType of this row.

I tried a helper column with the index number (there must be a better way too), and I can get the value for the baseline exam in this row - but I don't know how to then conditionally get the value in examType, when examID (of the baseline-row) == baselineExam (of the follow up row).

base R solutions or dplyr preferred, but open for anything

edit

I have changed the given data because my previous sample data did not fully reflect the complexity of the real data (I oversimplified it). Unfortunately, neither @www nor @akrun 's solution worked therefore - my bad, because I haven't provided a good enough sample :( There can be multiple rows per exam ID (long data, rows 1 and 2 in my example), and multiple baseline exams after another before the follow up exams were taken.

www · Accepted Answer

A solution using the dplyr and the tidyr package. the key is to replace fu with NA and then use the fill function to fill in the NA with the previous row. mutate_if is just converting factor column to a character column. dat2 is the final output.

library(dplyr)
library(tidyr)

dat2 <- dat %>%
  mutate_if(is.factor, as.character) %>%
  mutate(type = ifelse(examType %in% "fu", NA, examType)) %>%
  fill(type)
dat2
#   id examID baselineExam examType type
# 1  a     11           bl        x    x
# 2  a     12           11       fu    x
# 3  a     13           bl        y    y
# 4  a     14           13       fu    y
# 5  b     21           bl        x    x
# 6  b     22           21       fu    x
# 7  b     23           bl        z    z
# 8  b     24           23       fu    z

Update

We can use the dplyr package to achieve this. First, subset the data frame with known examType, find the rows with the unique combination between id, examID, and examType, join the table to the original data frame, and use coalesce to combine information.

library(dplyr)

dat2 <- dat %>%
  filter(!examType %in% "fu") %>%
  distinct(id, examID, examType) %>%
  rename(Type = examType) %>%
  left_join(dat, ., by = c("id", "baselineExam" = "examID")) %>%
  mutate(Type = coalesce(Type, examType))

dat2
#    id examID baselineExam examType Type
# 1   a     11           bl        x    x
# 2   a     11           bl        x    x
# 3   a     12           11       fu    x
# 4   a     13           11       fu    x
# 5   a     14           bl        y    y
# 6   a     15           bl        z    z
# 7   a     16           14       fu    y
# 8   a     17           14       fu    y
# 9   b     21           bl        x    x
# 10  b     22           bl        x    x
# 11  b     23           bl        y    y
# 12  b     24           bl        z    z
# 13  b     25           21       fu    x
# 14  b     26           21       fu    x
# 15  b     27           24       fu    z
# 16  b     28           24       fu    z

Data for the updated example

dat <- data.frame(id = rep(letters[1:2], each  = 8), 
                  examID = as.character(c(11,11:17, 21:28)), 
                  baselineExam = c(rep(c(rbind('bl',c(11,14))), each = 2), c(rep('bl', 4), 21, 21, 24, 24)), 
                  examType = c("x", "x", "fu", "fu", "y", "z", "fu", "fu", "x", "x", "y", "z", "fu", "fu", "fu", "fu"),
                  stringsAsFactors = FALSE)

Conditional statement referencing separate rows

Answers (2)

Update

Data for the updated example

Related Questions