marcel
marcel

Reputation: 409

Difficulty manipulating longitudinal data by row in R

I am having a bit of trouble working with longitudinal data: my dataset consists of one unique ID per row, followed by a series of visit dates. At each visit there are values for 3 dichotomous variables.

data1 <- structure(list(V1date = structure(c(2L, 1L, 2L, 3L, 4L), .Label = c("1/22/12", "4/5/12", "8/18/12", "9/6/12"), class = "factor"), 
V1a = structure(c(1L, 1L, 2L, 1L, 2L), .Label = c("No", "Yes"), class = "factor"), 
V1b = structure(c(2L, 1L, 1L, 1L, 1L), .Label = c("No", "Yes"), class = "factor"), 
V1c = structure(c(1L, 2L, 1L, 1L, 1L), .Label = c("No", "Yes"), class = "factor"), 
V2date = structure(c(1L, 2L, 4L, 3L, NA), .Label = c("6/18/12", "7/5/12", "9/22/12", "9/4/12"), class = "factor"), 
V2a = structure(c(1L, 1L, 1L, 1L, NA), .Label = "Yes", class = "factor"), 
V2b = structure(c(1L, 1L, 1L, 1L, NA), .Label = "No", class = "factor"), 
V2c = structure(c(1L, 1L, 1L, 1L, NA), .Label = "Yes", class = "factor"), 
V3date = structure(c(NA, NA, 1L, NA, 2L), .Label = c("11/1/12", "12/4/12"), class = "factor"), 
V3a = structure(c(NA, NA, 1L, NA, 1L), .Label = "Yes", class = "factor"), 
V3b = structure(c(NA, NA, 1L, NA, 1L), .Label = "No", class = "factor"), 
V3c = structure(c(NA, NA, 2L, NA, 1L), .Label = c("No", "Yes"), class = "factor")),
 .Names = c("V1date", "V1a", "V1b", "V1c", "V2date", "V2a", "V2b", "V2c", "V3date", "V3a", "V3b", "V3c"), 
class = "data.frame", row.names = c("001",  "002", "003", "004", "005"))

data1    
     V1date V1a V1b V1c  V2date  V2a  V2b  V2c  V3date  V3a  V3b  V3c
001  4/5/12  No Yes  No 6/18/12  Yes   No  Yes    <NA> <NA> <NA> <NA>
002 1/22/12  No  No Yes  7/5/12  Yes   No  Yes    <NA> <NA> <NA> <NA>
003  4/5/12 Yes  No  No  9/4/12  Yes   No  Yes 11/1/12  Yes   No  Yes
004 8/18/12  No  No  No 9/22/12  Yes   No  Yes    <NA> <NA> <NA> <NA>
005  9/6/12 Yes  No  No    <NA> <NA> <NA> <NA> 12/4/12  Yes   No   No

Of the 8 different possible combinations of the three variables, 4 are "abnormal" and the remaining 4 are "normal". Everyone starts out abnormal, and then either continues to be abnormal at subsequent visits, or resolves to a normal pattern at a later visit (I ignore reversion back to abnormal - once they are normal, they are normal)

I have to end up with 4 new columns added to the right-hand side of the dataframe indicating 1) date of last completed visit (regardless of intervening "NAs", 2) whether an ID eventually resolved or stayed abnormal, 3) if resolved, what the resolution pattern was and 4) what the date of resolution was. NAs always come in groups of 4 (ie no visit date, and no value for the 3 variables) and are ignored.

So for example, if the patterns "yes-yes-no", "yes-no-yes", "no-yes-yes" and "yes-yes-yes" are normal and the remaining patterns are all normal, the result would be four additional columns as follows;

data2 <- structure(list(
LastVisDate = structure(c(3L, 2L, 3L, 3L, 2L), .Label = c("6/18/12", "12/4/12", "11/1/12", "9/22/12"), class = "factor"), 
Resolved = structure(c(2L, 2L, 2L, 2L, 1L), .Label = c("No", "Yes"), class = "factor"), 
Pattern = structure(c(1L, 1L, 1L, 1L, NA), .Label = "yny", class = "factor"), 
Resdate = structure(c(1L, 2L, 3L, 4L, NA), .Label = c("6/18/12", "7/5/12", "9/4/12", "9/22/12"), class = "factor")),
.Names = c("LastVisDate", "Resolved", "Pattern", "Resdate"),   
class = "data.frame", row.names = c("001",  "002", "003", "004", "005"))

data2
    LastVisDate Resolved Pattern Resdate
001     11/1/12      Yes     yny 6/18/12
002     12/4/12      Yes     yny  7/5/12
003     11/1/12      Yes     yny  9/4/12
004     11/1/12      Yes     yny 9/22/12
005     12/4/12       No    <NA>    <NA>

I spent a lot of time on this project, but couldn't figure out how to ask R to march rightward through the dataset until my stopping rules are satisfied. Suggestions greatly appreciated.

Upvotes: 3

Views: 335

Answers (1)

Matthew Lundberg
Matthew Lundberg

Reputation: 42659

This relies on the structure of your data. In particular, that there are three values starting at columns 2, 6 and 10, which are passed to the function which determines if someone is "normal".

Here's a function to determine if someone is "normal". There are other ways to write this.

is.normal <- function(x) {
  any(c(
    all(x == c("Yes", "Yes", "No")),
    all(x == c("Yes", "No", "Yes")),
    all(x == c("No", "Yes", "Yes")),
    all(x == c("Yes", "Yes", "Yes"))
  ))
}

We use this, applying to the appropriate sets of columns. This depends on the exact layout that you have specified in the question. Note the column numbers passed to vapply. The result here is a logical matrix, telling if someone is "normal" at each step.

ok <- vapply(c(2,6,10),
         function(x) apply(data1[x:(x+2)], 1, is.normal ),
         logical(length(data1[,1])))

> ok
     [,1] [,2]  [,3]
001 FALSE TRUE    NA
002 FALSE TRUE    NA
003 FALSE TRUE  TRUE
004 FALSE TRUE    NA
005 FALSE   NA FALSE

Now find the first time that each person becomes "normal", if any. By inspection, that's 2 for everyone but the last, who remains abnormal. The if is used to prevent Inf return values from min when normalcy is not achieved.

date.ind <- apply(ok, 1,
              function(x) {
                y <- which(x)
                if (length(y)) min(y) else NA
              }
)

> date.ind
001 002 003 004 005 
  2   2   2   2  NA 

Then we can extract the date, knowing the "group" from above, and how to get to the actual date column where normalcy is achieved:

dates <- vapply(seq_along(date.ind), 
                function(x) if (is.na(date.ind[x])) as.character(NA) else as.character(data1[x,date.ind[x]*4-3]),
                character(1)
                )
> dates
[1] "6/18/12" "7/5/12"  "9/4/12"  "9/22/12" NA   

Extracting other information is similar, as the column indices can be computed as above.

Upvotes: 1

Related Questions