Reputation: 3
I'm trying to write a loop in R that has the same effect on a dataset as the following code:
AD1 <- subset(data, AD1==1)
AD2 <- subset(data, AD2==1)
AD3 <- subset(data, AD3==1)
AD4 <- subset(data, AD4==1)
AD5 <- subset(data, AD5==1)
After a few attempts, there is where I'm stuck:
for (i in 1:5) {
adNum <- paste("AD", i, sep="")
aSubset <- paste(adNum, " <- subset(data, ", adNum, "==1)", sep="")
aSubset
}
Ideally, I'd like to be able to apply the same principle to loop this code for all 5 subsets:
uAD1 <- mean(AD1$u)
vAD1 <- mean(AD1$v)
wAD1 <- mean(AD1$w)
xAD1 <- mean(AD1$x)
yAD1 <- mean(AD1$y)
zAD1 <- mean(AD1$z)
Upvotes: 0
Views: 1823
Reputation: 11957
It sounds like you're in "split-apply-combine" territory: you want to cut up a large data set into subsets, and apply the same operations to those subsets (or different columns of those subsets). While using looped code and parsed evaluation isn't wrong per se, this method tends to be fragile (breaks if something subtle changes), error-prone, and hard to read.
R has functions such as aggregate
and lapply/vapply
that are useful for these purposes. There are now also several mature R packages that help a person do exactly these types of operations (check out the data.table and tidyverse packages).
Consider the "diamonds" data set (which comes with the tidyverse packages). The first few rows look like this:
carat cut color clarity depth table price x y z
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
Suppose we wanted the average depth, table, and price for each type of cut. In base R, you could use aggregate
:
data.agg <- aggregate(diamonds[c('depth', 'table', 'price')], by = diamonds['cut'], FUN = mean)
cut depth table price
1 Fair 64.04168 59.05379 4358.758
2 Good 62.36588 58.69464 3928.864
3 Very Good 61.81828 57.95615 3981.760
4 Premium 61.26467 58.74610 4584.258
5 Ideal 61.70940 55.95167 3457.542
Or using tidyverse's dplyr package for data manipulation:
library(dplyr)
data.dplyr <- diamonds %>%
group_by(cut) %>%
select(depth, table, price) %>%
summarize_all(mean)
cut depth table price
<ord> <dbl> <dbl> <dbl>
1 Fair 64.04168 59.05379 4358.758
2 Good 62.36588 58.69464 3928.864
3 Very Good 61.81828 57.95615 3981.760
4 Premium 61.26467 58.74610 4584.258
5 Ideal 61.70940 55.95167 3457.542
While dplyr is more verbose than the basic aggregate
function (more typing), it's more flexible and allows for more advanced data manipulation and summarization operations. Note that both versions are much more succinct than the looping alternative, and easy to read and follow.
Upvotes: 2
Reputation: 1146
I'd suggest trying something in the lines of the following code, with modifications if needed, depending on what output structure you want.
#' subsets to list
aSubset <- lapply(1:5, FUN = function(X){
eval(parse(text = sprintf("subset(data, AD%d == 1)", X)))
})
#' assuming your original data came from a data.frame,
colNames <- c('u','v','w','x','z')
#' for each list element, calculate column means. I suggest modifying
#' the output to the structure you want, i.e. by using llply or etc
require(plyr)
adMeans <- ldply(aSubset, function(x){
colMeans(x[, colNames])
})
#' for a clearer output, these might be useful:
colnames(adMeans) <- colNames
rownames(adMeans) <- sprintf('AD%d', 1:5)
Upvotes: 0