Reputation: 143
I frequently want to perform functions on a set or range of columns in a dataframe. Most commonly, I want to take the mean of a range of columns that share a common prefix (in this toy example, VarA.
, VarB.
, and VarC.
:
ID<-c(1:300) #participant ID column, N=300
VarA.1<-sample(x = 0:50,size = 300, replace = TRUE)
VarA.2<-sample(x = 0:50,size = 300, replace = TRUE)
VarA.3<-sample(x = 0:50,size = 300, replace = TRUE)
VarB.1<-sample(x = 0:30,size = 300, replace = TRUE)
VarB.2<-sample(x = 0:30,size = 300, replace = TRUE)
VarB.3<-sample(x = 0:30,size = 300, replace = TRUE)
VarC.1<-sample(x = 0:10,size = 300, replace = TRUE)
VarC.2<-sample(x = 0:10,size = 300, replace = TRUE)
VarC.3<-sample(x = 0:10,size = 300, replace = TRUE)
df<-data.frame(ID,VarA.1,VarA.2,VarA.3,
VarB.1,VarB.2,VarB.3,
VarC.1,VarC.2,VarC.3)
rm(ID,VarA.1,VarA.2,VarA.3,
VarB.1,VarB.2,VarB.3,
VarC.1,VarC.2,VarC.3)
I usually have a ton of variables, so I can't memorize the column numbers. Let's say I want to take the average of all columns starting with VarA.
and put it in a new column called VarA
. Here is my usual approach:
x<-which(colnames(df)=="VarA.1")
y<-which(colnames(df)=="VarA.3")
df$VarA<-rowMeans(df[, c(x:y)])
Maybe I'm being too picky, but given that I have to do this (or something very similar) upwards of 20 times in some scripts, it just looks really messy and clunky, and it's hard to remember: I have to dig up a previous file and then copy and paste and carefully change all the values to fit my current dataset. I'd really like to make this into a function, but I'm not very familiar with user-defined functions and I'm having trouble figuring out how to deal with multiple variables.
The approach I tried was:
colmeans <- function(x,y,df,meancol) {
first<-which(colnames(df)==x)
last<-which(colnames(df)==y)
df$meancol<-rowMeans(df[, c(first:last)])
}
colmeans("VarA.1","VarA.3",df,"VarA")
I could have sworn it worked at one point but I lost it and I can't remember what I changed. What am I missing?
I'm also open to other ideas about how to make this process more efficient.
Upvotes: 1
Views: 54
Reputation: 79164
Are you looking for such a solution?
library(dplyr)
df %>%
mutate(across(starts_with("VarA"), mean, .names = "mean_{.col}"))
Output:
+ head()
ID VarA.1 VarA.2 VarA.3 VarB.1 VarB.2 VarB.3 VarC.1 VarC.2 VarC.3 mean_VarA.1 mean_VarA.2
1 1 41 37 7 0 14 1 10 7 0 23.97667 24.73667
2 2 50 36 20 16 29 7 0 10 9 23.97667 24.73667
3 3 2 5 43 20 24 9 5 8 8 23.97667 24.73667
4 4 9 39 41 15 21 5 9 6 3 23.97667 24.73667
5 5 38 25 37 20 19 24 6 5 4 23.97667 24.73667
6 6 12 27 47 28 14 14 10 5 2 23.97667 24.73667
mean_VarA.3
1 26.16667
2 26.16667
3 26.16667
4 26.16667
5 26.16667
6 26.16667
Upvotes: 2
Reputation: 887531
We can use split.default
lst1 <- lapply(split.default(df[-1], sub("\\.\\d+$", "", names(df)[-1])),
rowMeans, na.rm = TRUE)
df[paste0(names(lst1), "_Mean")] <- lst1
-output
head(df, 3)
ID VarA.1 VarA.2 VarA.3 VarB.1 VarB.2 VarB.3 VarC.1 VarC.2 VarC.3 VarA_Mean VarB_Mean VarC_Mean
1 1 25 40 27 6 9 2 3 1 0 30.66667 5.666667 1.333333
2 2 43 16 26 27 7 5 2 10 5 28.33333 13.000000 5.666667
3 3 14 34 38 9 10 7 3 9 9 28.66667 8.666667 7.000000
Or using tidyverse
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = -ID, names_to = ".value",
names_pattern = "^([^.]+)\\.\\d+") %>%
group_by(ID) %>%
summarise(across(everything(), mean, na.rm = TRUE,
.names = "{.col}_Mean"), .groups = 'drop') %>%
left_join(df)
Upvotes: 2