Reputation: 499
I have a data frame with a series of columns, some column names have the exact same name as others, but with an incremented numerical value inserted somewhere in the string (though the incremental values are random). To make things more confusing, column names can have numbers in them that are not part of the incremented system. I would like to match columns that are exactly the same, except for the incremented number, and then calculate the rowMeans for these columns. Here is an example data frame:
A2<-c(1,2,2,3)
B3<-c(3,4,3,2)
st<-c(1,2,2,3)
ba<-c(3,4,3,2)
Red1bax<-c(3,4,5,6)
Red2bax<-c(2,3,4,NA)
Red7bax<-c(3,4,5,6)
Red1str<-c(3,4,5,6)
Red2str<-c(2,3,4,5)
Red9str<-c(3,4,5,6)
N34_22str<-c(3,4,2,6)
N34_303str<-c(2,4,2,6)
DF<-data.frame(A2,B3,st,ba,N34_303str, Red1str,Red2str, Red7bax, Red2bax, Red9str, Red1bax, N34_22str)
My goal is to calculate rowMeans for the matched columns. In this case, the columns that match contain Red?bax, Red?str, and N34_?str (with the ? being the incremented number). So ideally I would get the mean values for these cols, and then drop them from the dataset. The end result would look like this:
DF$Redbax<-rowMeans(DF[,c('Red1bax','Red2bax','Red7bax')],na.rm=TRUE)
DF$Redstr<-rowMeans(DF[,c('Red1str','Red2str','Red9str')], na.rm=TRUE)
DF$N34<-rowMeans(DF[,c('N34_22str','N34_303str')], na.rm=TRUE)
DF<-DF[,-c(5:12)]
Upvotes: 0
Views: 136
Reputation: 389012
I don't know if this is your real data or your real data is different. In general, if you want to do this in an automated way, we need to come up with a pattern that uniquely identifies each set of columns.
For the example shared you may use this pattern -
cols <- 5:12
gsub('_.*|\\d+', '', names(DF)[cols])
#[1] "N" "Redstr" "Redstr" "Redbax" "Redbax" "Redstr" "Redbax" "N"
You can use these values as groups to split on and use it in split.default
, take rowwise mean of each group and bind the data to original columns.
cbind(DF[-cols],
sapply(split.default(DF[cols],
gsub('_.*|\\d+', '', names(DF)[cols])), rowMeans, na.rm = TRUE))
# A2 B3 st ba N Redbax Redstr
#1 1 3 1 3 2.5 2.666667 2.666667
#2 2 4 2 4 4.0 3.666667 3.666667
#3 2 3 2 3 2.0 4.666667 4.666667
#4 3 2 3 2 6.0 6.000000 5.666667
Upvotes: 1
Reputation: 27732
a data.table
approach using .SDcols
with a regex-patterns for column selection
library(data.table)
setDT(DF)
DF[, Redbax := rowMeans(.SD, na.rm = TRUE), .SDcols = patterns("^Red.*bax$")]
DF[, Redstr := rowMeans(.SD, na.rm = TRUE), .SDcols = patterns("^Red.*str$")]
DF[, N34 := rowMeans(.SD, na.rm = TRUE), .SDcols = patterns("^N34.*str$")]
# A2 B3 st ba N34_303str Red1str Red2str Red7bax Red2bax Red9str Red1bax N34_22str Redbax Redstr N34
# 1: 1 3 1 3 2 3 2 3 2 3 3 3 2.666667 2.666667 2.5
# 2: 2 4 2 4 4 4 3 4 3 4 4 4 3.666667 3.666667 4.0
# 3: 2 3 2 3 2 5 4 5 4 5 5 2 4.666667 4.666667 2.0
# 4: 3 2 3 2 6 6 5 6 NA 6 6 6 6.000000 5.666667 6.0
Upvotes: 1