Reputation: 1761
There are indeed some nice features in the dplyr package by Hadley Wickham.
I wonder if functions in the package can solve the following issues, which relates to longitudinal data (repeated measurements on same individuals); a subject for which I find scarce tutorials on the Internet.
The data set looks like this:
test <- read.table(header=TRUE, text = "
ID AGE YEAR_VISIT BLOOD_PRESSURE TREATMENT
1 20 2000 130 3
1 21 2001 129 2
1 22 2002 145 3
1 22 2002 130 2
2 23 2003 NA NA
2 30 2010 150 2
2 31 2011 110 3
4 50 2005 140 3
4 50 2005 130 3
4 50 2005 NA 3
4 51 2006 312 2
5 27 2010 140 4
5 28 2011 170 4
5 29 2012 160 NA
7 40 2007 120 NA
")
ID is the variable that identifies a particular individual (multiple rows = repeated measurements). AGE, YEAR_VISIT (year of examination), BLOOD_PRESSURE (continuous) and TREATMENT (type of treatment, categorical) is updated each registration.
I would like to do the following:
(1) identify and retain the first AGE and first YEAR_VISIT for each individual; this should generate two new variables, e.g called first_age and first_year.
2) create a new variable which is the updated mean value of the former and current value. This means that the updated blood pressure for the first observation is simply the current value; the next observation the updated value is equal to (former blood pressure value +current)/2. If current value is missing, then it should retain the former value (if there is one).
3) fill in missing values in categorical variable TREATMENT by carrying the former value onward (last observation carried forward).
I know thats more than one question, but it relates to the same topic, which is handling of longitudinal data.
I've done efforts to solve this myself without any success and perhaps more R users have the same difficulty?
I've tried the following:
Carry forward missing blood pressure and identify first observation for each individual
test4 <- test
test4$first <- !duplicated(test4$ID)
for(i in 2:nrow(test4)){
if(!test4$first[i] & test4$ID[i] == test4$ID[i-1] & is.na(test4$BLOOD_PRESSURE[i])){
test4$BLOOD_PRESSURE[i] <- test4$BLOOD_PRESSURE[i-1]
}
}
test
test4
Updated mean value for blood pressure
test5 <- test
test5$UM <- rep(NA, nrow(test5))
test5$first <- !duplicated(test5$ID)
for(i in 1:nrow(test5)){
if(test5$first[i]){
test5$UM[i] <- test5$BLOOD_PRESSURE[i]
}else{
test5$UM[i] <- mean(c(test5$BLOOD_PRESSURE[i] , test5$UM[i-1]), na.rm=TRUE)
}
}
test5
I believe thats quiet cumbersome to code, particularly since I'd like to update means and carry forward several variables...
As you point out, I'm new to R and really appreciate your help.
Upvotes: 1
Views: 1876
Reputation: 78802
Answering solely in defense of my claims in the comments:
library(dplyr)
library(zoo)
test %>%
group_by(ID) %>% # work on groups
arrange(YEAR_VISIT) %>% # arrange by year
mutate(first_age=min(AGE), # make col for min age
first_year=min(YEAR_VISIT), # make col for min year
bp_mean=mean(BLOOD_PRESSURE, na.rm=TRUE), # make col for mean
TREATMENT=na.locf(TREATMENT, na.rm=FALSE)) %>% # use zoo's na.locf to fill in NAs
ungroup()
## Source: local data frame [15 x 8]
##
## ID AGE YEAR_VISIT BLOOD_PRESSURE TREATMENT first_age first_year bp_mean
## 1 1 20 2000 130 3 20 2000 133.5000
## 2 1 21 2001 129 2 20 2000 133.5000
## 3 1 22 2002 145 3 20 2000 133.5000
## 4 1 22 2002 130 2 20 2000 133.5000
## 5 2 23 2003 NA NA 23 2003 130.0000
## 6 2 30 2010 150 2 23 2003 130.0000
## 7 2 31 2011 110 3 23 2003 130.0000
## 8 4 50 2005 140 3 50 2005 194.0000
## 9 4 50 2005 130 3 50 2005 194.0000
## 10 4 50 2005 NA 3 50 2005 194.0000
## 11 4 51 2006 312 2 50 2005 194.0000
...
Even if you don't know R, It'd be hard to argue that's not readable & concise.
Caveat: the NA
s in TREATMENT
can be filled in "reverse", but the OP did not say that was a requirement.
Upvotes: 4