Reputation: 11350
I have a data frame with over 50 columns, to which I need to add a few columns, each of which will contain the means of a few of the original columns.
The groups of columns are of varying lengths and the column names do not contain any element identifying the group they belong to.
I'd be grateful for any suggestions on how to do this efficiently.
Please demonstrate on this simplified example: the groups of columns would be (a,b)(c,d,e)(f,g) in data frame tb defined as:
tb<-data.frame(a=runif(10)*100,b=runif(10)*100,c=runif(10)*100,
d=runif(10)*100,e=runif(10)*100,f=runif(10)*100,g=runif(10)*100)
tb
a b c d e f g
1 26.550866 20.59746 93.470523 48.20801 82.09463 47.761962 91.287592
2 37.212390 17.65568 21.214252 59.95658 64.70602 86.120948 29.360337
3 57.285336 68.70228 65.167377 49.35413 78.29328 43.809711 45.906573
4 90.820779 38.41037 12.555510 18.62176 55.30363 24.479728 33.239467
5 20.168193 76.98414 26.722067 82.73733 52.97196 7.067905 65.087047
6 89.838968 49.76992 38.611409 66.84667 78.93562 9.946616 25.801678
7 94.467527 71.76185 1.339033 79.42399 2.33312 31.627171 47.854525
8 66.079779 99.19061 38.238796 10.79436 47.72301 51.863426 76.631067
9 62.911404 38.00352 86.969085 72.37109 73.23137 66.200508 8.424691
10 6.178627 77.74452 34.034900 41.12744 69.27316 40.683019 87.532133
the result of the operation could look like this:
a b c d e f g mean_ab mean_cde mean_fg
1 26.551 20.597 93.471 48.208 82.095 47.762 91.288 23.574 74.591 69.525
2 37.212 17.656 21.214 59.957 64.706 86.121 29.360 27.434 48.626 57.741
3 57.285 68.702 65.167 49.354 78.293 43.810 45.907 62.994 64.272 44.858
4 90.821 38.410 12.556 18.622 55.304 24.480 33.239 64.616 28.827 28.860
5 20.168 76.984 26.722 82.737 52.972 7.068 65.087 48.576 54.144 36.077
6 89.839 49.770 38.611 66.847 78.936 9.947 25.802 69.804 61.465 17.874
7 94.468 71.762 1.339 79.424 2.333 31.627 47.855 83.115 27.699 39.741
8 66.080 99.191 38.239 10.794 47.723 51.863 76.631 82.635 32.252 64.247
9 62.911 38.004 86.969 72.371 73.231 66.201 8.425 50.457 77.524 37.313
10 6.179 77.745 34.035 41.127 69.273 40.683 87.532 41.962 48.145 64.108
thanks in advance to any suggestions.
Upvotes: 1
Views: 220
Reputation: 44614
One more way, using split.default
to group the columns.
grouping <- c('mean_ab', 'mean_ab', 'mean_cde', 'mean_cde', 'mean_cde', 'mean_fg', 'mean_fg')
data.frame(tb, lapply(split.default(tb, grouping), rowMeans))
Upvotes: 1
Reputation: 23758
It sounds like in real life you're just going to select each set of columns independently. Probably the easiest thing for you to do is pick your columns, like "a" and "b", and then...
tb$mean_ab <- rowMeans(tb[,c('a','b')])
Now, suppose you have a list of the names, or indexes (doesn't matter), of the columns. Then you could do something like this.
cols <- list(c("a", "b"), c("c", "d", "e"))
meanCols <- lapply( cols, function(x) rowMeans(tb[,x]) )
meanCols <- data.frame( meanCols )
That's all you need to generate the columns. To further use your list of column names to name the values and add them back into tb
the following will work.
meanIDs <- do.call( paste0, cols )
names(meanCols) <- paste0( 'mean_', meanIDs )
tb <- cbind( tb, meanCols )
But if it's only a few columns it's probably just as easy to write the first command a few times with new column IDS.
Upvotes: 1
Reputation: 721
Something like that perhaps? It might be a little bit more modular and could save you some typing in the future.
groups <- list(c("a", "b"), c("c", "d", "e"), c("f", "g"))
tmp <- sapply(groups, function(set) {
res <- rowMeans(tb[,set])
return(res)
})
res <- data.frame(tb, tmp)
colnames(res) <- c(colnames(tb),
sapply(groups,
function(x) paste("mean", paste(x, collapse=""), sep="_")))
Upvotes: 1