Vinterwoo
Vinterwoo

Reputation: 3941

Calculate row means based on (partial) matching column names

I am starting with 3 large data tables (named A1,A2,A3). Each table has 4 data columns (V1-V4), 1 "Date" column that is constant across all three tables, and thousands of rows.

Here is some dummy data that approximates my tables.

A1.V1<-c(1,2,3,4)
A1.V2<-c(2,4,6,8)
A1.V3<-c(1,3,5,7)
A1.V4<-c(1,2,3,4)


A2.V1<-c(1,2,3,4)
A2.V2<-c(2,4,6,8)
A2.V3<-c(1,3,5,7)
A2.V4<-c(1,2,3,4)


A3.V1<-c(1,2,3,4)
A3.V2<-c(2,4,6,8)
A3.V3<-c(1,3,5,7)
A3.V4<-c(1,2,3,4)

Date<-c(2001,2002,2003,2004)

DF<-data.frame(Date, A1.V1,A1.V2,A1.V3,A1.V4,A2.V1,A2.V2,A2.V3,A2.V4,A3.V1,A3.V2,A3.V3,A3.V4)

So this is what my data frame ends up looking like:

  Date A1.V1 A1.V2 A1.V3 A1.V4 A2.V1 A2.V2 A2.V3 A2.V4 A3.V1 A3.V2 A3.V3 A3.V4
1 2001     1     2     1     1     1     2     1     1     1     2     1     1
2 2002     2     4     3     2     2     4     3     2     2     4     3     2
3 2003     3     6     5     3     3     6     5     3     3     6     5     3
4 2004     4     8     7     4     4     8     7     4     4     8     7     4

My goal is to calculate the row mean for each of the matching columns from each data table. So in this instance, I would want row means for all columns ending in V1, all columns ending in V2, all columns ending in V3 and all columns ending in V4.

The end result would look like this

      V1  V2  V3  V4
2001   1   2   1   1
2002   2   4   3   2
2003   3   6   5   3
2004   4   8   7   4

So my question is, how to I go about calculating row means based on a partial match in the column name?

Thanks

Upvotes: 3

Views: 5586

Answers (4)

IRTFM
IRTFM

Reputation: 263411

colnames = c("V1", "V2", "V3", "V4")
res <- sapply(colnames, function(x) rowMeans(DF [, grep(x, names(DF))] )  )
rownames(res) <- DF$Date
res
     V1 V2 V3 V4
2001  1  2  1  1
2002  2  4  3  2
2003  3  6  5  3
2004  4  8  7  4

The R grep function returns an integer vector that is used to selectively "pull" columns containing individual "V"-column names from the larger dataframe.

If you needed to generate the names automagically:

> unique(sapply(strsplit(names(DF)[-1], ".", fixed=TRUE), "[", 2) )
[1] "V1" "V2" "V3" "V4"

Upvotes: 7

Timo
Timo

Reputation: 5390

I'm sure it can be done more elegantly, but this is one possibility that seems to work.

# declare the column names
colnames = c("V1", "V2", "V3", "V4")

# calculate the means
means = lapply(colnames, function(name) { apply(DF[,grep(name, names(DF))], 1, mean) })

# build the result
result = do.call(cbind, means)
result = as.data.frame(t(result))
rownames(result) = DF$Date

I should also describe, what I did.

First, I declared the column names to be partially matched.

Then, using the grep command to partially select the columns in your data frame (that matched the particular substring). The apply command calculates the means and lapply does it for all columns partially matched by the substring.

Using do.call and cbind (as suggested by DWin), we concatenate individual columns. Finally, we set the column names from the Date column of the original data frame.

The problem can be solved more elgantly and efficiently, see solutions by DWin and Maiasaura.

Upvotes: 0

Maiasaura
Maiasaura

Reputation: 32986

library(plyr)
ddply(DF, .(Date), function(x) {
    foo <- melt(x, id.vars = 1)
    foo$variable <- substr(foo$variable, 4, 6)
    return(dcast(foo, Date ~ variable, mean))
    })
Date V1 V2 V3 V4
1 2001  1  2  1  1
2 2002  2  4  3  2
3 2003  3  6  5  3
4 2004  4  8  7  4

Upvotes: 4

mnel
mnel

Reputation: 115425

You can use grep with value = T to get the appropriate names and then create call to eval within the j component of a data.table

library(data.table)
# convert to a data.table
DT <- data.table(DF)
 # the indices we wish to group
.index <- paste0('V',1:3)
# a list containing the names
name_list <- mapply(grep, pattern = as.list(.index ), 
                  MoreArgs = list(x= names(DT),value=T ), SIMPLIFY=F)
 # create the expression
.e <- parse(text=sprintf('list( %s)', paste(mapply(sprintf, .index, lapply(name_list, paste, collapse = ', '), 
            MoreArgs = list(fmt = '%s = mean(c(%s), na.rm = T)')), collapse = ',')))

DT[, eval(.e),by=Date]

##    Date V1 V2 V3
## 1: 2001  1  2  1
## 2: 2002  2  4  3
## 3: 2003  3  6  5
## 4: 2004  4  8  7

# what .e looks like
.e 
## expression(list( V1 = mean(c(A1.V1, A2.V1, A3.V1), na.rm = T),V2 = mean(c(A1.V2, A2.V2, A3.V2), na.rm = T),V3 = mean(c(A1.V3, A2.V3, A3.V3), na.rm = T)))

Upvotes: 2

Related Questions