Vint
Vint

Reputation: 499

Match column names by sequential numeric pattern, and calculate rowMeans for all matches

I have a data frame with a series of columns, some column names have the exact same name as others, but with an incremented numerical value inserted somewhere in the string (though the incremental values are random). To make things more confusing, column names can have numbers in them that are not part of the incremented system. I would like to match columns that are exactly the same, except for the incremented number, and then calculate the rowMeans for these columns. Here is an example data frame:

A2<-c(1,2,2,3)
B3<-c(3,4,3,2)
st<-c(1,2,2,3)
ba<-c(3,4,3,2)
Red1bax<-c(3,4,5,6)
Red2bax<-c(2,3,4,NA)
Red7bax<-c(3,4,5,6)
Red1str<-c(3,4,5,6)
Red2str<-c(2,3,4,5)
Red9str<-c(3,4,5,6)
N34_22str<-c(3,4,2,6)
N34_303str<-c(2,4,2,6)

DF<-data.frame(A2,B3,st,ba,N34_303str, Red1str,Red2str, Red7bax, Red2bax, Red9str, Red1bax, N34_22str)

My goal is to calculate rowMeans for the matched columns. In this case, the columns that match contain Red?bax, Red?str, and N34_?str (with the ? being the incremented number). So ideally I would get the mean values for these cols, and then drop them from the dataset. The end result would look like this:

DF$Redbax<-rowMeans(DF[,c('Red1bax','Red2bax','Red7bax')],na.rm=TRUE)
DF$Redstr<-rowMeans(DF[,c('Red1str','Red2str','Red9str')], na.rm=TRUE)
DF$N34<-rowMeans(DF[,c('N34_22str','N34_303str')], na.rm=TRUE)

DF<-DF[,-c(5:12)]

Upvotes: 0

Views: 136

Answers (2)

Ronak Shah
Ronak Shah

Reputation: 389012

I don't know if this is your real data or your real data is different. In general, if you want to do this in an automated way, we need to come up with a pattern that uniquely identifies each set of columns.

For the example shared you may use this pattern -

cols <- 5:12
gsub('_.*|\\d+', '', names(DF)[cols])
#[1] "N"      "Redstr" "Redstr" "Redbax" "Redbax" "Redstr" "Redbax" "N"   

You can use these values as groups to split on and use it in split.default, take rowwise mean of each group and bind the data to original columns.

cbind(DF[-cols], 
      sapply(split.default(DF[cols], 
             gsub('_.*|\\d+', '', names(DF)[cols])), rowMeans, na.rm = TRUE))

#  A2 B3 st ba   N   Redbax   Redstr
#1  1  3  1  3 2.5 2.666667 2.666667
#2  2  4  2  4 4.0 3.666667 3.666667
#3  2  3  2  3 2.0 4.666667 4.666667
#4  3  2  3  2 6.0 6.000000 5.666667

Upvotes: 1

Wimpel
Wimpel

Reputation: 27732

a data.table approach using .SDcols with a regex-patterns for column selection

library(data.table)
setDT(DF)
DF[, Redbax := rowMeans(.SD, na.rm = TRUE), .SDcols = patterns("^Red.*bax$")]
DF[, Redstr := rowMeans(.SD, na.rm = TRUE), .SDcols = patterns("^Red.*str$")]
DF[, N34 := rowMeans(.SD, na.rm = TRUE), .SDcols = patterns("^N34.*str$")]
#    A2 B3 st ba N34_303str Red1str Red2str Red7bax Red2bax Red9str Red1bax N34_22str   Redbax   Redstr N34
# 1:  1  3  1  3          2       3       2       3       2       3       3         3 2.666667 2.666667 2.5
# 2:  2  4  2  4          4       4       3       4       3       4       4         4 3.666667 3.666667 4.0
# 3:  2  3  2  3          2       5       4       5       4       5       5         2 4.666667 4.666667 2.0
# 4:  3  2  3  2          6       6       5       6      NA       6       6         6 6.000000 5.666667 6.0

Upvotes: 1

Related Questions