user2500444
user2500444

Reputation: 111

How to calculate the mean of those columns in a data frame with the same column name

I have a data frame consisting of 10299 observations of 66 variables. Some of these variables share a common column name, and I would like to calculate the mean of those variables for each observation.

Having the following matrix, with column names c(A, B, C, B, A ,C):

A B C B A C                             
1 2 3 4 5 6         
3 5 6 7 4 3                             
3 3 3 3 5 5                             
2 2 2 2 2 2

I would like to get:

A   B   C    
3   3   4.5
3.5 6   4.5
4   3   4
2   2   2

I tried for loops, the command aggregate() but I don't get the desired result.

Sorry if the question seems too basic, I have checked google for possible solutions but I didn't find any.

Upvotes: 4

Views: 6370

Answers (4)

d.b
d.b

Reputation: 32558

You can use split.default to divide the dataframe into different groups based in column names and then use rowMeans to get the average of columns with same name.

sapply(split.default(df, names(df)), rowMeans)
#       A B   C
#[1,] 3.0 3 4.5
#[2,] 3.5 6 4.5
#[3,] 4.0 3 4.0
#[4,] 2.0 2 2.0

DATA

df = structure(list(A = c(1L, 3L, 3L, 2L), B = c(2L, 5L, 3L, 2L), 
    C = c(3L, 6L, 3L, 2L), B = c(4L, 7L, 3L, 2L), A = c(5L, 4L, 
    5L, 2L), C = c(6L, 3L, 5L, 2L)), .Names = c("A", "B", "C", 
"B", "A", "C"), class = "data.frame", row.names = c(NA, -4L))

Upvotes: 1

gagolews
gagolews

Reputation: 13076

Here's a solution.

First let's define an exemplary data.frame (the same as in your example).

df <- as.data.frame(
    matrix(c(1,3,3,2,2,5,3,2,3,6,3,2,4,7,3,2,5,4,5,2,6,3,5,2),
        ncol=6,
        dimnames=list(NULL, c("A", "B", "C", "B", "A", "C"))
    )
  )

Below we apply a custom function on each unique column name, col: it selects all the columns named col and calculates rowMeans. The result, list of atomic vectors, will be coerced to a data.frame:

res <- as.data.frame( # sapply returns a list here, so we convert it to a data.frame
    sapply(unique(names(df)), # for each unique column name
       function(col) rowMeans(df[names(df) == col]) # calculate row means
    )
  )

The result:

res
##     A B   C
## 1 3.0 3 4.5
## 2 3.5 6 4.5
## 3 4.0 3 4.0
## 4 2.0 2 2.0

EDIT: As there are many solutions proposed already, let's benchmark them:

set.seed(123)
df <- as.data.frame(matrix(sample(1:9, replace=TRUE, 10000*100),
   dimnames=list(NULL, sample(LETTERS[1:5], 100, replace=TRUE)), ncol=100))
library(microbenchmark)
microbenchmark(...)
## Unit: milliseconds
##                   min         lq     median         uq        max neval
## @gagolews   61.196075   65.73211   77.22533  119.42028  127.32557    10
## @joran       8.297964   10.05242   10.90564   15.25943   65.69156    10
## @Davide   5535.272680 5731.24220 5754.67006 5808.47807 5862.22628    10

The clear winner (at least as far as speed is concerned) is @joran's lapply+split+Reduce. Congrats! :-)

Upvotes: 8

Davide Passaretti
Davide Passaretti

Reputation: 2761

Using a combination of apply and tapply:

t(apply(df, 1, function(x) tapply(x, colnames(df), mean)))

#        A B   C
# [1,] 3.0 3 4.5
# [2,] 3.5 6 4.5
# [3,] 4.0 3 4.0
# [4,] 2.0 2 2.0

Upvotes: 6

joran
joran

Reputation: 173727

This works but isn't as nice as gegolews solution, in my opinion:

x <- read.table(text = "A B C B A C                             
 1 2 3 4 5 6         
 3 5 6 7 4 3                             
 3 3 3 3 5 5                             
 2 2 2 2 2 2",header = TRUE,sep = "",check.names = FALSE)

as.data.frame(lapply(split(as.list(x),f = colnames(x)),function(x) Reduce(`+`,x) / length(x)))
##    A B   C
##1 3.0 3 4.5
##2 3.5 6 4.5
##3 4.0 3 4.0
##4 2.0 2 2.0

Upvotes: 6

Related Questions