Kevin T
Kevin T

Reputation: 151

R Mutate Column That Represents the Mean of Many Other Columns

I am trying to add a column in my DataFrame that represents the mean of many other columns (items that represent a single construct).

The dataframe has many other columns but particularly columns eng1, eng2, eng3...engN where N is a large number and I want to take the mean of all the eng* columns and add that mean as a new column to my dataset.

I was able to do this with the following code:

narrow_ds # ... initialization of dataframe
library(dplyr)
narrow_ds <- bind_cols(narrow_ds, (narrow_ds %>% 
select(starts_with("eng")) %>% mutate(eng=rowMeans(., na.rm=TRUE))) %>% 
select(eng))

It seems that having the na.rm=TRUE requirement forced me to jump through some hoops.

My question is whether there are more straightforward ways to do this?

Upvotes: 5

Views: 5021

Answers (5)

After struggling with the same issue, I think the easiest way to make operations (mean, sd, sums, etc) whitn colums is by useing "rowwise()" comand from "dplyr", and grouping target columns with "c()" inside the wanted operation:

C1<-c(3,2,4,4,5)
C2<-c(3,7,3,4,5)
C3<-c(5,4,3,6,3)
DF<-data.frame(ID=c("A","B","C","D","E"),C1=C1,C2=C2,C3=C3)


library(dplyr)

DF <- DF %>% 
     rowwise() %>% mutate(Avg = (mean(c(C1, C2, C3), na.rm = T))) 

DF
   ID    C1    C2    C3    Avg
  <fct> <dbl> <dbl> <dbl> <dbl>
1 A         3     3     5  3.67
2 B         2     7     4  4.33
3 C         4     3     3  3.33
4 D         4     4     6  4.67
5 E         5     5     3  4.33

Also one can ensure that this operation ends on a single data frame doing the following:

DF <- as.data.frame(DF %>% 
         rowwise() %>% mutate(Avg = (mean(c(C1, C2, C3), na.rm = T)))) 

Upvotes: 2

AkselA
AkselA

Reputation: 8856

# Example data
set.seed(1)
dtf <- as.data.frame(matrix(sample(1:10000), ncol=1000))
nam <- replicate(ncol(dtf), paste(sample(LETTERS, 10, replace=TRUE), collapse=""))
colnames(dtf) <- nam

I didn't name the columns 'eng*', but this is functionally the same.
Only the columns whose name starts with an A will be averaged.

eng <- rowMeans(dtf[, grep("^A", colnames(dtf))], na.rm=TRUE)
dtf <- cbind(eng, dtf)

summary(dtf)[,1:4]
# #      eng         BRTCBDWVWA     COCSSUQNLA     FIOULRNUXL  
# Min.   :4535   Min.   : 618   Min.   :1764   Min.   : 134  
# 1st Qu.:4780   1st Qu.:2922   1st Qu.:3805   1st Qu.:2254  
# Median :5187   Median :6008   Median :5916   Median :3604  
# Mean   :5107   Mean   :5513   Mean   :5580   Mean   :4174  
# 3rd Qu.:5337   3rd Qu.:8386   3rd Qu.:7557   3rd Qu.:5840  
# Max.   :5739   Max.   :9442   Max.   :9903   Max.   :9329  

Upvotes: 1

MKR
MKR

Reputation: 20095

You are very much in right direction. You can avoid bind_cols with a tweak in your code. Moreover, the NA are supported even in rowMeans. I have modified the sample data used by @Tung to include few NAs as well. The solutions can be as:

Option#1: Using dplyr in similar approach as OP.

library(dplyr)
DF %>% mutate(eng = rowMeans(select(.,starts_with("eng")), na.rm = TRUE))

# # A tibble: 4 x 5
#      id  eng1  eng2  eng3   eng
#   <int> <dbl> <dbl> <dbl> <dbl>
# 1     1  50.0    NA  20.0  35.0
# 2     2  NA     100  10.0  55.0
# 3     3  20.0   150  80.0  83.3
# 4     4  30.0   200  40.0  90.0

Option#2: Using apply

DF$eng <- apply(DF[,grep("eng",names(DF))], 1, mean, na.rm = TRUE)

DF
# # A tibble: 4 x 5
#      id  eng1  eng2  eng3   eng
#    <int> <dbl> <dbl> <dbl> <dbl>
# 1     1  50.0    NA  20.0  35.0
# 2     2  NA     100  10.0  55.0
# 3     3  20.0   150  80.0  83.3
# 4     4  30.0   200  40.0  90.0

Sample data:

DF = data_frame(id = 1:4,
                eng1 = c(50, NA, 20, 30), 
                eng2 = c(NA, 100, 150, 200), 
                eng3 = c(20, 10, 80, 40))

Upvotes: 7

moodymudskipper
moodymudskipper

Reputation: 47350

Stealing sample data from @MKR, in base R:

DF$eng <- rowMeans(DF[startsWith(names(DF),"eng")], na.rm = TRUE)

# # A tibble: 4 x 5
#      id  eng1  eng2  eng3      eng
#   <int> <dbl> <dbl> <dbl>    <dbl>
# 1     1    50    NA    20 35.00000
# 2     2    NA   100    10 55.00000
# 3     3    20   150    80 83.33333
# 4     4    30   200    40 90.00000

Upvotes: 5

Tung
Tung

Reputation: 28441

You can use pmap_dbl to loop through every row of your data frame. Here is an example with some random data

library(tidyverse)

DF = data_frame(eng1 = c(50, 40, 20, 30), 
                eng2 = c(130, 100, 150, 200), 
                eng3 = c(20, 10, 80, 40))
DF

#> # A tibble: 4 x 3
#>    eng1  eng2  eng3
#>   <dbl> <dbl> <dbl>
#> 1    50   130    20
#> 2    40   100    10
#> 3    20   150    80
#> 4    30   200    40

DF %>% 
  select(starts_with("eng")) %>% 
  mutate(eng = pmap_dbl(., sum))

#> # A tibble: 4 x 4
#>    eng1  eng2  eng3   eng
#>   <dbl> <dbl> <dbl> <dbl>
#> 1    50   130    20   200
#> 2    40   100    10   150
#> 3    20   150    80   250
#> 4    30   200    40   270

# or define a function
sumAll <- function(...) {
  x = c(...)
  sum(x, na.rm = TRUE)
}

DF %>% 
  select(starts_with("eng")) %>% 
  mutate(eng = pmap_dbl(., sumAll))

#> # A tibble: 4 x 4
#>    eng1  eng2  eng3   eng
#>   <dbl> <dbl> <dbl> <dbl>
#> 1    50   130    20   200
#> 2    40   100    10   150
#> 3    20   150    80   250
#> 4    30   200    40   270

Created on 2018-05-22 by the reprex package (v0.2.0).

Upvotes: 1

Related Questions