Reputation: 3941
I am starting with 3 large data tables (named A1,A2,A3). Each table has 4 data columns (V1-V4), 1 "Date" column that is constant across all three tables, and thousands of rows.
Here is some dummy data that approximates my tables.
A1.V1<-c(1,2,3,4)
A1.V2<-c(2,4,6,8)
A1.V3<-c(1,3,5,7)
A1.V4<-c(1,2,3,4)
A2.V1<-c(1,2,3,4)
A2.V2<-c(2,4,6,8)
A2.V3<-c(1,3,5,7)
A2.V4<-c(1,2,3,4)
A3.V1<-c(1,2,3,4)
A3.V2<-c(2,4,6,8)
A3.V3<-c(1,3,5,7)
A3.V4<-c(1,2,3,4)
Date<-c(2001,2002,2003,2004)
DF<-data.frame(Date, A1.V1,A1.V2,A1.V3,A1.V4,A2.V1,A2.V2,A2.V3,A2.V4,A3.V1,A3.V2,A3.V3,A3.V4)
So this is what my data frame ends up looking like:
Date A1.V1 A1.V2 A1.V3 A1.V4 A2.V1 A2.V2 A2.V3 A2.V4 A3.V1 A3.V2 A3.V3 A3.V4
1 2001 1 2 1 1 1 2 1 1 1 2 1 1
2 2002 2 4 3 2 2 4 3 2 2 4 3 2
3 2003 3 6 5 3 3 6 5 3 3 6 5 3
4 2004 4 8 7 4 4 8 7 4 4 8 7 4
My goal is to calculate the row mean for each of the matching columns from each data table. So in this instance, I would want row means for all columns ending in V1, all columns ending in V2, all columns ending in V3 and all columns ending in V4.
The end result would look like this
V1 V2 V3 V4
2001 1 2 1 1
2002 2 4 3 2
2003 3 6 5 3
2004 4 8 7 4
So my question is, how to I go about calculating row means based on a partial match in the column name?
Thanks
Upvotes: 3
Views: 5586
Reputation: 263411
colnames = c("V1", "V2", "V3", "V4")
res <- sapply(colnames, function(x) rowMeans(DF [, grep(x, names(DF))] ) )
rownames(res) <- DF$Date
res
V1 V2 V3 V4
2001 1 2 1 1
2002 2 4 3 2
2003 3 6 5 3
2004 4 8 7 4
The R grep
function returns an integer vector that is used to selectively "pull" columns containing individual "V"-column names from the larger dataframe.
If you needed to generate the names automagically:
> unique(sapply(strsplit(names(DF)[-1], ".", fixed=TRUE), "[", 2) )
[1] "V1" "V2" "V3" "V4"
Upvotes: 7
Reputation: 5390
I'm sure it can be done more elegantly, but this is one possibility that seems to work.
# declare the column names
colnames = c("V1", "V2", "V3", "V4")
# calculate the means
means = lapply(colnames, function(name) { apply(DF[,grep(name, names(DF))], 1, mean) })
# build the result
result = do.call(cbind, means)
result = as.data.frame(t(result))
rownames(result) = DF$Date
I should also describe, what I did.
First, I declared the column names to be partially matched.
Then, using the grep
command to partially select the columns in your data frame (that matched the particular substring). The apply
command calculates the means and lapply
does it for all columns partially matched by the substring.
Using do.call
and cbind
(as suggested by DWin), we concatenate individual columns.
Finally, we set the column names from the Date
column of the original data frame.
The problem can be solved more elgantly and efficiently, see solutions by DWin and Maiasaura.
Upvotes: 0
Reputation: 32986
library(plyr)
ddply(DF, .(Date), function(x) {
foo <- melt(x, id.vars = 1)
foo$variable <- substr(foo$variable, 4, 6)
return(dcast(foo, Date ~ variable, mean))
})
Date V1 V2 V3 V4
1 2001 1 2 1 1
2 2002 2 4 3 2
3 2003 3 6 5 3
4 2004 4 8 7 4
Upvotes: 4
Reputation: 115425
You can use grep
with value = T
to get the appropriate names and then create call to eval
within the j
component of a data.table
library(data.table)
# convert to a data.table
DT <- data.table(DF)
# the indices we wish to group
.index <- paste0('V',1:3)
# a list containing the names
name_list <- mapply(grep, pattern = as.list(.index ),
MoreArgs = list(x= names(DT),value=T ), SIMPLIFY=F)
# create the expression
.e <- parse(text=sprintf('list( %s)', paste(mapply(sprintf, .index, lapply(name_list, paste, collapse = ', '),
MoreArgs = list(fmt = '%s = mean(c(%s), na.rm = T)')), collapse = ',')))
DT[, eval(.e),by=Date]
## Date V1 V2 V3
## 1: 2001 1 2 1
## 2: 2002 2 4 3
## 3: 2003 3 6 5
## 4: 2004 4 8 7
# what .e looks like
.e
## expression(list( V1 = mean(c(A1.V1, A2.V1, A3.V1), na.rm = T),V2 = mean(c(A1.V2, A2.V2, A3.V2), na.rm = T),V3 = mean(c(A1.V3, A2.V3, A3.V3), na.rm = T)))
Upvotes: 2