Adam Robinsson
Adam Robinsson

Reputation: 1761

Handling of longitudinal (repeated measurements) data in dplyr?

There are indeed some nice features in the dplyr package by Hadley Wickham.

I wonder if functions in the package can solve the following issues, which relates to longitudinal data (repeated measurements on same individuals); a subject for which I find scarce tutorials on the Internet.

The data set looks like this:

test <- read.table(header=TRUE, text = "
  ID  AGE   YEAR_VISIT  BLOOD_PRESSURE  TREATMENT
  1 20  2000    130 3
  1 21  2001    129 2
  1 22  2002    145 3
  1 22  2002    130 2
  2 23  2003    NA  NA
  2 30  2010    150 2
  2 31  2011    110 3
  4 50  2005    140 3
  4 50  2005    130 3
  4 50  2005    NA  3
  4 51  2006    312 2
  5 27  2010    140 4
  5 28  2011    170 4
  5 29  2012    160 NA
  7 40  2007    120 NA
                   ")

ID is the variable that identifies a particular individual (multiple rows = repeated measurements). AGE, YEAR_VISIT (year of examination), BLOOD_PRESSURE (continuous) and TREATMENT (type of treatment, categorical) is updated each registration.

I would like to do the following:

(1) identify and retain the first AGE and first YEAR_VISIT for each individual; this should generate two new variables, e.g called first_age and first_year.

2) create a new variable which is the updated mean value of the former and current value. This means that the updated blood pressure for the first observation is simply the current value; the next observation the updated value is equal to (former blood pressure value +current)/2. If current value is missing, then it should retain the former value (if there is one).

3) fill in missing values in categorical variable TREATMENT by carrying the former value onward (last observation carried forward).

I know thats more than one question, but it relates to the same topic, which is handling of longitudinal data.

I've done efforts to solve this myself without any success and perhaps more R users have the same difficulty?

I've tried the following:

Carry forward missing blood pressure and identify first observation for each individual

test4 <- test
test4$first <- !duplicated(test4$ID)
for(i in 2:nrow(test4)){
  if(!test4$first[i] & test4$ID[i] == test4$ID[i-1] & is.na(test4$BLOOD_PRESSURE[i])){
    test4$BLOOD_PRESSURE[i] <- test4$BLOOD_PRESSURE[i-1]
  }
}
test
test4

Updated mean value for blood pressure

test5 <- test
test5$UM <- rep(NA, nrow(test5))
test5$first <- !duplicated(test5$ID)
for(i in 1:nrow(test5)){
  if(test5$first[i]){
    test5$UM[i] <- test5$BLOOD_PRESSURE[i]
  }else{
    test5$UM[i] <- mean(c(test5$BLOOD_PRESSURE[i] , test5$UM[i-1]), na.rm=TRUE)
  }
}
test5

I believe thats quiet cumbersome to code, particularly since I'd like to update means and carry forward several variables...

As you point out, I'm new to R and really appreciate your help.

Upvotes: 1

Views: 1876

Answers (1)

hrbrmstr
hrbrmstr

Reputation: 78802

Answering solely in defense of my claims in the comments:

library(dplyr)
library(zoo)

test %>%
  group_by(ID) %>%                                       # work on groups
  arrange(YEAR_VISIT) %>%                                # arrange by year
  mutate(first_age=min(AGE),                             # make col for min age
         first_year=min(YEAR_VISIT),                     # make col for min year
         bp_mean=mean(BLOOD_PRESSURE, na.rm=TRUE),       # make col for mean
         TREATMENT=na.locf(TREATMENT, na.rm=FALSE)) %>%  # use zoo's na.locf to fill in NAs
  ungroup()

##     Source: local data frame [15 x 8]
## 
##    ID AGE YEAR_VISIT BLOOD_PRESSURE TREATMENT first_age first_year  bp_mean
## 1   1  20       2000            130         3        20       2000 133.5000
## 2   1  21       2001            129         2        20       2000 133.5000
## 3   1  22       2002            145         3        20       2000 133.5000
## 4   1  22       2002            130         2        20       2000 133.5000
## 5   2  23       2003             NA        NA        23       2003 130.0000
## 6   2  30       2010            150         2        23       2003 130.0000
## 7   2  31       2011            110         3        23       2003 130.0000
## 8   4  50       2005            140         3        50       2005 194.0000
## 9   4  50       2005            130         3        50       2005 194.0000
## 10  4  50       2005             NA         3        50       2005 194.0000
## 11  4  51       2006            312         2        50       2005 194.0000
...

Even if you don't know R, It'd be hard to argue that's not readable & concise.

Caveat: the NAs in TREATMENT can be filled in "reverse", but the OP did not say that was a requirement.

Upvotes: 4

Related Questions