Reputation: 151
I am trying to add a column in my DataFrame that represents the mean of many other columns (items that represent a single construct).
The dataframe has many other columns but particularly columns eng1, eng2, eng3...engN where N is a large number and I want to take the mean of all the eng* columns and add that mean as a new column to my dataset.
I was able to do this with the following code:
narrow_ds # ... initialization of dataframe
library(dplyr)
narrow_ds <- bind_cols(narrow_ds, (narrow_ds %>%
select(starts_with("eng")) %>% mutate(eng=rowMeans(., na.rm=TRUE))) %>%
select(eng))
It seems that having the na.rm=TRUE requirement forced me to jump through some hoops.
My question is whether there are more straightforward ways to do this?
Upvotes: 5
Views: 5021
Reputation: 21
After struggling with the same issue, I think the easiest way to make operations (mean, sd, sums, etc) whitn colums is by useing "rowwise()" comand from "dplyr", and grouping target columns with "c()" inside the wanted operation:
C1<-c(3,2,4,4,5)
C2<-c(3,7,3,4,5)
C3<-c(5,4,3,6,3)
DF<-data.frame(ID=c("A","B","C","D","E"),C1=C1,C2=C2,C3=C3)
library(dplyr)
DF <- DF %>%
rowwise() %>% mutate(Avg = (mean(c(C1, C2, C3), na.rm = T)))
DF
ID C1 C2 C3 Avg
<fct> <dbl> <dbl> <dbl> <dbl>
1 A 3 3 5 3.67
2 B 2 7 4 4.33
3 C 4 3 3 3.33
4 D 4 4 6 4.67
5 E 5 5 3 4.33
Also one can ensure that this operation ends on a single data frame doing the following:
DF <- as.data.frame(DF %>%
rowwise() %>% mutate(Avg = (mean(c(C1, C2, C3), na.rm = T))))
Upvotes: 2
Reputation: 8856
# Example data
set.seed(1)
dtf <- as.data.frame(matrix(sample(1:10000), ncol=1000))
nam <- replicate(ncol(dtf), paste(sample(LETTERS, 10, replace=TRUE), collapse=""))
colnames(dtf) <- nam
I didn't name the columns 'eng*', but this is functionally the same.
Only the columns whose name starts with an A
will be averaged.
eng <- rowMeans(dtf[, grep("^A", colnames(dtf))], na.rm=TRUE)
dtf <- cbind(eng, dtf)
summary(dtf)[,1:4]
# # eng BRTCBDWVWA COCSSUQNLA FIOULRNUXL
# Min. :4535 Min. : 618 Min. :1764 Min. : 134
# 1st Qu.:4780 1st Qu.:2922 1st Qu.:3805 1st Qu.:2254
# Median :5187 Median :6008 Median :5916 Median :3604
# Mean :5107 Mean :5513 Mean :5580 Mean :4174
# 3rd Qu.:5337 3rd Qu.:8386 3rd Qu.:7557 3rd Qu.:5840
# Max. :5739 Max. :9442 Max. :9903 Max. :9329
Upvotes: 1
Reputation: 20095
You are very much in right direction. You can avoid bind_cols
with a tweak in your code. Moreover, the NA
are supported even in rowMeans
. I have modified the sample data used by @Tung
to include few NAs
as well. The solutions can be as:
Option#1: Using dplyr
in similar approach as OP.
library(dplyr)
DF %>% mutate(eng = rowMeans(select(.,starts_with("eng")), na.rm = TRUE))
# # A tibble: 4 x 5
# id eng1 eng2 eng3 eng
# <int> <dbl> <dbl> <dbl> <dbl>
# 1 1 50.0 NA 20.0 35.0
# 2 2 NA 100 10.0 55.0
# 3 3 20.0 150 80.0 83.3
# 4 4 30.0 200 40.0 90.0
Option#2: Using apply
DF$eng <- apply(DF[,grep("eng",names(DF))], 1, mean, na.rm = TRUE)
DF
# # A tibble: 4 x 5
# id eng1 eng2 eng3 eng
# <int> <dbl> <dbl> <dbl> <dbl>
# 1 1 50.0 NA 20.0 35.0
# 2 2 NA 100 10.0 55.0
# 3 3 20.0 150 80.0 83.3
# 4 4 30.0 200 40.0 90.0
Sample data:
DF = data_frame(id = 1:4,
eng1 = c(50, NA, 20, 30),
eng2 = c(NA, 100, 150, 200),
eng3 = c(20, 10, 80, 40))
Upvotes: 7
Reputation: 47350
Stealing sample data from @MKR, in base R:
DF$eng <- rowMeans(DF[startsWith(names(DF),"eng")], na.rm = TRUE)
# # A tibble: 4 x 5
# id eng1 eng2 eng3 eng
# <int> <dbl> <dbl> <dbl> <dbl>
# 1 1 50 NA 20 35.00000
# 2 2 NA 100 10 55.00000
# 3 3 20 150 80 83.33333
# 4 4 30 200 40 90.00000
Upvotes: 5
Reputation: 28441
You can use pmap_dbl
to loop through every row of your data frame. Here is an example with some random data
library(tidyverse)
DF = data_frame(eng1 = c(50, 40, 20, 30),
eng2 = c(130, 100, 150, 200),
eng3 = c(20, 10, 80, 40))
DF
#> # A tibble: 4 x 3
#> eng1 eng2 eng3
#> <dbl> <dbl> <dbl>
#> 1 50 130 20
#> 2 40 100 10
#> 3 20 150 80
#> 4 30 200 40
DF %>%
select(starts_with("eng")) %>%
mutate(eng = pmap_dbl(., sum))
#> # A tibble: 4 x 4
#> eng1 eng2 eng3 eng
#> <dbl> <dbl> <dbl> <dbl>
#> 1 50 130 20 200
#> 2 40 100 10 150
#> 3 20 150 80 250
#> 4 30 200 40 270
# or define a function
sumAll <- function(...) {
x = c(...)
sum(x, na.rm = TRUE)
}
DF %>%
select(starts_with("eng")) %>%
mutate(eng = pmap_dbl(., sumAll))
#> # A tibble: 4 x 4
#> eng1 eng2 eng3 eng
#> <dbl> <dbl> <dbl> <dbl>
#> 1 50 130 20 200
#> 2 40 100 10 150
#> 3 20 150 80 250
#> 4 30 200 40 270
Created on 2018-05-22 by the reprex package (v0.2.0).
Upvotes: 1