Reputation: 111
I have a data frame consisting of 10299 observations of 66 variables. Some of these variables share a common column name, and I would like to calculate the mean of those variables for each observation.
Having the following matrix, with column names c(A, B, C, B, A ,C)
:
A B C B A C
1 2 3 4 5 6
3 5 6 7 4 3
3 3 3 3 5 5
2 2 2 2 2 2
I would like to get:
A B C
3 3 4.5
3.5 6 4.5
4 3 4
2 2 2
I tried for loops, the command aggregate()
but I don't get the desired result.
Sorry if the question seems too basic, I have checked google for possible solutions but I didn't find any.
Upvotes: 4
Views: 6370
Reputation: 32558
You can use split.default
to divide the dataframe into different groups based in column names and then use rowMeans
to get the average of columns with same name.
sapply(split.default(df, names(df)), rowMeans)
# A B C
#[1,] 3.0 3 4.5
#[2,] 3.5 6 4.5
#[3,] 4.0 3 4.0
#[4,] 2.0 2 2.0
DATA
df = structure(list(A = c(1L, 3L, 3L, 2L), B = c(2L, 5L, 3L, 2L),
C = c(3L, 6L, 3L, 2L), B = c(4L, 7L, 3L, 2L), A = c(5L, 4L,
5L, 2L), C = c(6L, 3L, 5L, 2L)), .Names = c("A", "B", "C",
"B", "A", "C"), class = "data.frame", row.names = c(NA, -4L))
Upvotes: 1
Reputation: 13076
Here's a solution.
First let's define an exemplary data.frame (the same as in your example).
df <- as.data.frame(
matrix(c(1,3,3,2,2,5,3,2,3,6,3,2,4,7,3,2,5,4,5,2,6,3,5,2),
ncol=6,
dimnames=list(NULL, c("A", "B", "C", "B", "A", "C"))
)
)
Below we apply a custom function on each unique column name, col
:
it selects all the columns named col
and calculates rowMeans
. The result, list of atomic vectors, will be coerced to a data.frame:
res <- as.data.frame( # sapply returns a list here, so we convert it to a data.frame
sapply(unique(names(df)), # for each unique column name
function(col) rowMeans(df[names(df) == col]) # calculate row means
)
)
The result:
res
## A B C
## 1 3.0 3 4.5
## 2 3.5 6 4.5
## 3 4.0 3 4.0
## 4 2.0 2 2.0
EDIT: As there are many solutions proposed already, let's benchmark them:
set.seed(123)
df <- as.data.frame(matrix(sample(1:9, replace=TRUE, 10000*100),
dimnames=list(NULL, sample(LETTERS[1:5], 100, replace=TRUE)), ncol=100))
library(microbenchmark)
microbenchmark(...)
## Unit: milliseconds
## min lq median uq max neval
## @gagolews 61.196075 65.73211 77.22533 119.42028 127.32557 10
## @joran 8.297964 10.05242 10.90564 15.25943 65.69156 10
## @Davide 5535.272680 5731.24220 5754.67006 5808.47807 5862.22628 10
The clear winner (at least as far as speed is concerned) is @joran's lapply
+split
+Reduce
. Congrats! :-)
Upvotes: 8
Reputation: 2761
Using a combination of apply
and tapply
:
t(apply(df, 1, function(x) tapply(x, colnames(df), mean)))
# A B C
# [1,] 3.0 3 4.5
# [2,] 3.5 6 4.5
# [3,] 4.0 3 4.0
# [4,] 2.0 2 2.0
Upvotes: 6
Reputation: 173727
This works but isn't as nice as gegolews solution, in my opinion:
x <- read.table(text = "A B C B A C
1 2 3 4 5 6
3 5 6 7 4 3
3 3 3 3 5 5
2 2 2 2 2 2",header = TRUE,sep = "",check.names = FALSE)
as.data.frame(lapply(split(as.list(x),f = colnames(x)),function(x) Reduce(`+`,x) / length(x)))
## A B C
##1 3.0 3 4.5
##2 3.5 6 4.5
##3 4.0 3 4.0
##4 2.0 2 2.0
Upvotes: 6