eli-k
eli-k

Reputation: 11350

calculating the means of groups of columns in a data frame

I have a data frame with over 50 columns, to which I need to add a few columns, each of which will contain the means of a few of the original columns.

The groups of columns are of varying lengths and the column names do not contain any element identifying the group they belong to.

I'd be grateful for any suggestions on how to do this efficiently.

Please demonstrate on this simplified example: the groups of columns would be (a,b)(c,d,e)(f,g) in data frame tb defined as:

tb<-data.frame(a=runif(10)*100,b=runif(10)*100,c=runif(10)*100,  
               d=runif(10)*100,e=runif(10)*100,f=runif(10)*100,g=runif(10)*100)  
tb

    a        b         c        d        e         f         g
1  26.550866 20.59746 93.470523 48.20801 82.09463 47.761962 91.287592
2  37.212390 17.65568 21.214252 59.95658 64.70602 86.120948 29.360337
3  57.285336 68.70228 65.167377 49.35413 78.29328 43.809711 45.906573
4  90.820779 38.41037 12.555510 18.62176 55.30363 24.479728 33.239467
5  20.168193 76.98414 26.722067 82.73733 52.97196  7.067905 65.087047
6  89.838968 49.76992 38.611409 66.84667 78.93562  9.946616 25.801678
7  94.467527 71.76185  1.339033 79.42399  2.33312 31.627171 47.854525
8  66.079779 99.19061 38.238796 10.79436 47.72301 51.863426 76.631067
9  62.911404 38.00352 86.969085 72.37109 73.23137 66.200508  8.424691
10  6.178627 77.74452 34.034900 41.12744 69.27316 40.683019 87.532133

the result of the operation could look like this:

    a       b       c       d       e       f       g      mean_ab mean_cde mean_fg
1   26.551  20.597  93.471  48.208  82.095  47.762  91.288  23.574  74.591  69.525
2   37.212  17.656  21.214  59.957  64.706  86.121  29.360  27.434  48.626  57.741
3   57.285  68.702  65.167  49.354  78.293  43.810  45.907  62.994  64.272  44.858
4   90.821  38.410  12.556  18.622  55.304  24.480  33.239  64.616  28.827  28.860
5   20.168  76.984  26.722  82.737  52.972  7.068   65.087  48.576  54.144  36.077
6   89.839  49.770  38.611  66.847  78.936  9.947   25.802  69.804  61.465  17.874
7   94.468  71.762  1.339   79.424  2.333   31.627  47.855  83.115  27.699  39.741
8   66.080  99.191  38.239  10.794  47.723  51.863  76.631  82.635  32.252  64.247
9   62.911  38.004  86.969  72.371  73.231  66.201  8.425   50.457  77.524  37.313
10  6.179   77.745  34.035  41.127  69.273  40.683  87.532  41.962  48.145  64.108

thanks in advance to any suggestions.

Upvotes: 1

Views: 220

Answers (3)

Matthew Plourde
Matthew Plourde

Reputation: 44614

One more way, using split.default to group the columns.

grouping <- c('mean_ab', 'mean_ab', 'mean_cde', 'mean_cde', 'mean_cde', 'mean_fg', 'mean_fg')
data.frame(tb, lapply(split.default(tb, grouping), rowMeans))

Upvotes: 1

John
John

Reputation: 23758

It sounds like in real life you're just going to select each set of columns independently. Probably the easiest thing for you to do is pick your columns, like "a" and "b", and then...

tb$mean_ab <- rowMeans(tb[,c('a','b')])

Now, suppose you have a list of the names, or indexes (doesn't matter), of the columns. Then you could do something like this.

cols <- list(c("a", "b"), c("c", "d", "e"))
meanCols <- lapply( cols, function(x) rowMeans(tb[,x]) )
meanCols <- data.frame( meanCols )

That's all you need to generate the columns. To further use your list of column names to name the values and add them back into tb the following will work.

meanIDs <- do.call( paste0, cols )
names(meanCols) <- paste0( 'mean_', meanIDs )
tb <- cbind( tb, meanCols )

But if it's only a few columns it's probably just as easy to write the first command a few times with new column IDS.

Upvotes: 1

tophcito
tophcito

Reputation: 721

Something like that perhaps? It might be a little bit more modular and could save you some typing in the future.

groups <- list(c("a", "b"), c("c", "d", "e"), c("f", "g"))                                                                                            

tmp <- sapply(groups, function(set) {                                                                                                                 
  res <- rowMeans(tb[,set])                                                                                                                           
  return(res)                                                                                                                                         
})                                                                                                                                                    

res <- data.frame(tb, tmp)                                                                                                                            
colnames(res) <- c(colnames(tb),                                                                                                                      
                   sapply(groups,                                                                                                                     
                          function(x) paste("mean", paste(x, collapse=""), sep="_")))

Upvotes: 1

Related Questions