Reputation: 61
Working in a data frame, I want to create a new column using mutate that averages all columns in each row together, besides one, based on column name. I need to be able to exclude a certain column in each use of mutate, and I'd like the calculation to skip over NA values as well.
Simple version of my DF:
Team stat1 stat2 stat3 stat4
1 ARI 3 NA 4 6
2 BAL NA 2 NA 1
3 CAR 5 4 6 2
NewCol1 created from calculating mean of stat columns, excluding 'stat 1' column and NA values. Same thing done for NewCol2, calculated mean excludes the 'stat2' column:
Team stat1 stat2 stat3 stat4 NewCol1 NewCol2
1 ARI 3 NA 4 6 5.0 4.33
2 BAL NA 2 NA 1 1.5 1.00
3 CAR 5 4 6 2 4.0 4.33
What would be the most efficient way to do this if I want to create new columns that do the same thing for each stat? The DF has 10 stat columns, each with the same name and then a number after each name. I was thinking the starts_with() function might be of use here with rowMeans(), but struggling with how I'd implement that while also excluding a certain column each time.
Upvotes: 1
Views: 938
Reputation: 887851
We can use rowMeans
after select
ing out the relevant columns
library(dplyr)
df1 %>%
mutate(NewCol1 = rowMeans(select(., -Team, -stat1), na.rm = TRUE),
NewCol2 = rowMeans(select(., -Team, -stat2), na.rm = TRUE))
-output
# Team stat1 stat2 stat3 stat4 NewCol1 NewCol2
#1 ARI 3 NA 4 6 5.0 4.333333
#2 BAL NA 2 NA 1 1.5 1.000000
#3 CAR 5 4 6 2 4.0 4.333333
Or another option with c_across
df1 %>%
rowwise %>%
mutate(NewCol1 = mean(c_across(c(where(is.numeric), -stat1)), na.rm = TRUE),
NewCol2 = mean(c_across(c(starts_with('stat'), -stat2)), na.rm = TRUE),
NewCol3 = mean(c_across(c(starts_with('stat'), -stat3)), na.rm = TRUE),
NewCol4 = mean(c_across(c(starts_with('stat'), -stat4)), na.rm = TRUE)) %>%
ungroup
-output
# A tibble: 3 x 9
# Team stat1 stat2 stat3 stat4 NewCol1 NewCol2 NewCol3 NewCol4
# <chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
#1 ARI 3 NA 4 6 5 4.33 4.5 3.5
#2 BAL NA 2 NA 1 1.5 1 1.5 2
#3 CAR 5 4 6 2 4 4.33 3.67 5
If we want to do this automatically, an option is
library(purrr)
df1[paste0("NewCol", 1:2)] <- map(c('stat1', 'stat2'),
~ df1 %>%
select(starts_with('stat'), -.x) %>%
rowMeans(., na.rm = TRUE))
Or to create columns 1 to 4
nm1 <- names(df1)[startsWith(names(df1), 'stat')]
df1[paste0("NewCol", seq_along(nm1))] <- map(nm1,
~ df1 %>%
select(starts_with('stat'), -.x) %>%
rowMeans(., na.rm = TRUE))
-output
df1
# Team stat1 stat2 stat3 stat4 NewCol1 NewCol2 NewCol3 NewCol4
#1 ARI 3 NA 4 6 5.0 4.333333 4.500000 3.5
#2 BAL NA 2 NA 1 1.5 1.000000 1.500000 2.0
#3 CAR 5 4 6 2 4.0 4.333333 3.666667 5.0
Or to do this fully in tidyverse
library(stringr)
map_dfc(nm1, ~
df1 %>%
select(starts_with('stat'), -.x) %>%
transmute(!! str_c('NewCol', readr::parse_number(.x)) :=
rowMeans(., na.rm = TRUE))) %>%
bind_cols(df1, .)
# Team stat1 stat2 stat3 stat4 NewCol1 NewCol2 NewCol3 NewCol4
#1 ARI 3 NA 4 6 5.0 4.333333 4.500000 3.5
#2 BAL NA 2 NA 1 1.5 1.000000 1.500000 2.0
#3 CAR 5 4 6 2 4.0 4.333333 3.666667 5.0
Or using rowwise/c_across
map_dfc(nm1, ~
df1 %>%
select(starts_with('stat'), -.x) %>% rowwise %>%
transmute(!! str_c('NewCol', readr::parse_number(.x)) := mean(c_across(everything()), na.rm = TRUE))) %>%
ungroup %>%
bind_cols(df1, .)
-output
# Team stat1 stat2 stat3 stat4 NewCol1 NewCol2 NewCol3 NewCol4
#1 ARI 3 NA 4 6 5.0 4.333333 4.500000 3.5
#2 BAL NA 2 NA 1 1.5 1.000000 1.500000 2.0
#3 CAR 5 4 6 2 4.0 4.333333 3.666667 5.0
Or using base R
df1[paste0("NewCol", seq_along(nm1))] <- lapply(nm1,
function(x) rowMeans(df1[setdiff(names(df1)[-1], x)], na.rm = TRUE))
df1 <- structure(list(Team = c("ARI", "BAL", "CAR"), stat1 = c(3L, NA,
5L), stat2 = c(NA, 2L, 4L), stat3 = c(4L, NA, 6L), stat4 = c(6L,
1L, 2L)), class = "data.frame", row.names = c("1", "2", "3"))
Upvotes: 2
Reputation: 389235
In base R, you can find the columns which has 'stat'
in it and one by one remove it from lapply
and take row-wise mean of it.
cols <- grep('stat', names(df))
new_cols <- paste0('remove_', names(df)[cols])
df[new_cols] <- lapply(cols, function(x) rowMeans(df[, -c(1, x)], na.rm = TRUE))
df
# Team stat1 stat2 stat3 stat4 remove_stat1 remove_stat2 remove_stat3 remove_stat4
#1 ARI 3 NA 4 6 5.0 4.333333 4.500000 3.5
#2 BAL NA 2 NA 1 1.5 1.000000 1.500000 2.0
#3 CAR 5 4 6 2 4.0 4.333333 3.666667 5.0
Upvotes: 2