Match column names by sequential numeric pattern, and calculate rowMeans for all matches

Question

I have a data frame with a series of columns, some column names have the exact same name as others, but with an incremented numerical value inserted somewhere in the string (though the incremental values are random). To make things more confusing, column names can have numbers in them that are not part of the incremented system. I would like to match columns that are exactly the same, except for the incremented number, and then calculate the rowMeans for these columns. Here is an example data frame:

A2<-c(1,2,2,3)
B3<-c(3,4,3,2)
st<-c(1,2,2,3)
ba<-c(3,4,3,2)
Red1bax<-c(3,4,5,6)
Red2bax<-c(2,3,4,NA)
Red7bax<-c(3,4,5,6)
Red1str<-c(3,4,5,6)
Red2str<-c(2,3,4,5)
Red9str<-c(3,4,5,6)
N34_22str<-c(3,4,2,6)
N34_303str<-c(2,4,2,6)

DF<-data.frame(A2,B3,st,ba,N34_303str, Red1str,Red2str, Red7bax, Red2bax, Red9str, Red1bax, N34_22str)

My goal is to calculate rowMeans for the matched columns. In this case, the columns that match contain Red?bax, Red?str, and N34_?str (with the ? being the incremented number). So ideally I would get the mean values for these cols, and then drop them from the dataset. The end result would look like this:

DF$Redbax<-rowMeans(DF[,c('Red1bax','Red2bax','Red7bax')],na.rm=TRUE)
DF$Redstr<-rowMeans(DF[,c('Red1str','Red2str','Red9str')], na.rm=TRUE)
DF$N34<-rowMeans(DF[,c('N34_22str','N34_303str')], na.rm=TRUE)

DF<-DF[,-c(5:12)]

Wimpel · Accepted Answer

a data.table approach using .SDcols with a regex-patterns for column selection

library(data.table)
setDT(DF)
DF[, Redbax := rowMeans(.SD, na.rm = TRUE), .SDcols = patterns("^Red.*bax$")]
DF[, Redstr := rowMeans(.SD, na.rm = TRUE), .SDcols = patterns("^Red.*str$")]
DF[, N34 := rowMeans(.SD, na.rm = TRUE), .SDcols = patterns("^N34.*str$")]
#    A2 B3 st ba N34_303str Red1str Red2str Red7bax Red2bax Red9str Red1bax N34_22str   Redbax   Redstr N34
# 1:  1  3  1  3          2       3       2       3       2       3       3         3 2.666667 2.666667 2.5
# 2:  2  4  2  4          4       4       3       4       3       4       4         4 3.666667 3.666667 4.0
# 3:  2  3  2  3          2       5       4       5       4       5       5         2 4.666667 4.666667 2.0
# 4:  3  2  3  2          6       6       5       6      NA       6       6         6 6.000000 5.666667 6.0

Match column names by sequential numeric pattern, and calculate rowMeans for all matches

Answers (2)

Related Questions