Reputation: 101
I've just transitioned to using R from SAS and I'm working with a very large data set (half a million observations and 20 thousand variables) that needs quite a bit of recoding. I imagine this is a pretty basic question, but I'm still learning so I'd really appreciate any guidance!
Many of the variables have three instances and each instance has multiple arrays. For this problem, I am using the "History of Father's Illness." There are many illnesses included, but I am primarily interested in CAD (coded as "1").
An example of how the data looks:
n_20107_0_0 n_20107_0_1 n_20107_0_2
NA NA NA
7 1 8
4 6 1
I've only included 3 arrays here, but in reality there are close to 20. I did a bit of research and determined that the most efficient way to do this would be to create a list with the variables and then use lapply. This is what I have attempted:
FatherDisease1 <- paste("n_20107_0_", 0:3, sep = "")
lapply(FatherDisease1, transform, FatherCAD_0_0 = ifelse(FatherDisease1 == 1, 1, 0))
I don't quite get the results I am looking for when I do this.
n_20107_0_0 n_20107_0_1 n_20107_0_2 FatherCAD_0_0
NA NA NA 0
7 1 8 0
4 6 1 0
What I would like to do is go through all of the 3 instances and if the person had answered 1, then for "FatherCAD_0_0" to equal 1, if not then "FatherCAD_0_0" equals 0, but I only ever end up with 0's. As for the NA's I would like for them to stay as NAs. This is what I would like it to look like:
n_20107_0_0 n_20107_0_1 n_20107_0_2 FatherCAD_0_0
NA NA NA NA
7 1 8 1
4 6 1 1
I've figured out how to do this the "long" way (30+ lines of code -_-) but am trying to get better at writing more elegant and efficient code. Any help would be greatly appreciated!!
Upvotes: 0
Views: 3316
Reputation: 14370
Assuming your data is in a data.frame
you could use apply to loop over each row and check if any of the columns you are interested have a 1:
FatherDisease1 <- paste("n_20107_0_", 0:2, sep = "")
df$FatherCAD_0_0 <- apply(df, 1, function(x) as.integer(any(x[FatherDisease1] == 1)))
df
# n_20107_0_0 n_20107_0_1 n_20107_0_2 FatherCAD_0_0
#1 NA NA NA NA
#2 7 1 8 1
#3 4 6 1 1
Data:
df <- structure(list(n_20107_0_0 = c(NA, 7L, 4L), n_20107_0_1 = c(NA,
1L, 6L), n_20107_0_2 = c(NA, 8L, 1L)), .Names = c("n_20107_0_0",
"n_20107_0_1", "n_20107_0_2"), row.names = c(NA, -3L), class = "data.frame")
Upvotes: 1