Reputation: 409
I am having a bit of trouble working with longitudinal data: my dataset consists of one unique ID per row, followed by a series of visit dates. At each visit there are values for 3 dichotomous variables.
data1 <- structure(list(V1date = structure(c(2L, 1L, 2L, 3L, 4L), .Label = c("1/22/12", "4/5/12", "8/18/12", "9/6/12"), class = "factor"),
V1a = structure(c(1L, 1L, 2L, 1L, 2L), .Label = c("No", "Yes"), class = "factor"),
V1b = structure(c(2L, 1L, 1L, 1L, 1L), .Label = c("No", "Yes"), class = "factor"),
V1c = structure(c(1L, 2L, 1L, 1L, 1L), .Label = c("No", "Yes"), class = "factor"),
V2date = structure(c(1L, 2L, 4L, 3L, NA), .Label = c("6/18/12", "7/5/12", "9/22/12", "9/4/12"), class = "factor"),
V2a = structure(c(1L, 1L, 1L, 1L, NA), .Label = "Yes", class = "factor"),
V2b = structure(c(1L, 1L, 1L, 1L, NA), .Label = "No", class = "factor"),
V2c = structure(c(1L, 1L, 1L, 1L, NA), .Label = "Yes", class = "factor"),
V3date = structure(c(NA, NA, 1L, NA, 2L), .Label = c("11/1/12", "12/4/12"), class = "factor"),
V3a = structure(c(NA, NA, 1L, NA, 1L), .Label = "Yes", class = "factor"),
V3b = structure(c(NA, NA, 1L, NA, 1L), .Label = "No", class = "factor"),
V3c = structure(c(NA, NA, 2L, NA, 1L), .Label = c("No", "Yes"), class = "factor")),
.Names = c("V1date", "V1a", "V1b", "V1c", "V2date", "V2a", "V2b", "V2c", "V3date", "V3a", "V3b", "V3c"),
class = "data.frame", row.names = c("001", "002", "003", "004", "005"))
data1
V1date V1a V1b V1c V2date V2a V2b V2c V3date V3a V3b V3c
001 4/5/12 No Yes No 6/18/12 Yes No Yes <NA> <NA> <NA> <NA>
002 1/22/12 No No Yes 7/5/12 Yes No Yes <NA> <NA> <NA> <NA>
003 4/5/12 Yes No No 9/4/12 Yes No Yes 11/1/12 Yes No Yes
004 8/18/12 No No No 9/22/12 Yes No Yes <NA> <NA> <NA> <NA>
005 9/6/12 Yes No No <NA> <NA> <NA> <NA> 12/4/12 Yes No No
Of the 8 different possible combinations of the three variables, 4 are "abnormal" and the remaining 4 are "normal". Everyone starts out abnormal, and then either continues to be abnormal at subsequent visits, or resolves to a normal pattern at a later visit (I ignore reversion back to abnormal - once they are normal, they are normal)
I have to end up with 4 new columns added to the right-hand side of the dataframe indicating 1) date of last completed visit (regardless of intervening "NAs", 2) whether an ID eventually resolved or stayed abnormal, 3) if resolved, what the resolution pattern was and 4) what the date of resolution was. NAs always come in groups of 4 (ie no visit date, and no value for the 3 variables) and are ignored.
So for example, if the patterns "yes-yes-no", "yes-no-yes", "no-yes-yes" and "yes-yes-yes" are normal and the remaining patterns are all normal, the result would be four additional columns as follows;
data2 <- structure(list(
LastVisDate = structure(c(3L, 2L, 3L, 3L, 2L), .Label = c("6/18/12", "12/4/12", "11/1/12", "9/22/12"), class = "factor"),
Resolved = structure(c(2L, 2L, 2L, 2L, 1L), .Label = c("No", "Yes"), class = "factor"),
Pattern = structure(c(1L, 1L, 1L, 1L, NA), .Label = "yny", class = "factor"),
Resdate = structure(c(1L, 2L, 3L, 4L, NA), .Label = c("6/18/12", "7/5/12", "9/4/12", "9/22/12"), class = "factor")),
.Names = c("LastVisDate", "Resolved", "Pattern", "Resdate"),
class = "data.frame", row.names = c("001", "002", "003", "004", "005"))
data2
LastVisDate Resolved Pattern Resdate
001 11/1/12 Yes yny 6/18/12
002 12/4/12 Yes yny 7/5/12
003 11/1/12 Yes yny 9/4/12
004 11/1/12 Yes yny 9/22/12
005 12/4/12 No <NA> <NA>
I spent a lot of time on this project, but couldn't figure out how to ask R to march rightward through the dataset until my stopping rules are satisfied. Suggestions greatly appreciated.
Upvotes: 3
Views: 335
Reputation: 42659
This relies on the structure of your data. In particular, that there are three values starting at columns 2, 6 and 10, which are passed to the function which determines if someone is "normal".
Here's a function to determine if someone is "normal". There are other ways to write this.
is.normal <- function(x) {
any(c(
all(x == c("Yes", "Yes", "No")),
all(x == c("Yes", "No", "Yes")),
all(x == c("No", "Yes", "Yes")),
all(x == c("Yes", "Yes", "Yes"))
))
}
We use this, applying to the appropriate sets of columns. This depends on the exact layout that you have specified in the question. Note the column numbers passed to vapply. The result here is a logical matrix, telling if someone is "normal" at each step.
ok <- vapply(c(2,6,10),
function(x) apply(data1[x:(x+2)], 1, is.normal ),
logical(length(data1[,1])))
> ok
[,1] [,2] [,3]
001 FALSE TRUE NA
002 FALSE TRUE NA
003 FALSE TRUE TRUE
004 FALSE TRUE NA
005 FALSE NA FALSE
Now find the first time that each person becomes "normal", if any. By inspection, that's 2 for everyone but the last, who remains abnormal. The if
is used to prevent Inf
return values from min
when normalcy is not achieved.
date.ind <- apply(ok, 1,
function(x) {
y <- which(x)
if (length(y)) min(y) else NA
}
)
> date.ind
001 002 003 004 005
2 2 2 2 NA
Then we can extract the date, knowing the "group" from above, and how to get to the actual date column where normalcy is achieved:
dates <- vapply(seq_along(date.ind),
function(x) if (is.na(date.ind[x])) as.character(NA) else as.character(data1[x,date.ind[x]*4-3]),
character(1)
)
> dates
[1] "6/18/12" "7/5/12" "9/4/12" "9/22/12" NA
Extracting other information is similar, as the column indices can be computed as above.
Upvotes: 1