Remi.b
Remi.b

Reputation: 18219

R, efficient way, matrix operation

I have a large matrix. The goal is to recognize all rows that have all equal values in column 1,2 and 3 and compute the percentage of 0 in the 4th column for each set of rows. Put all this percentage in a vector called "data". Then I need three vectors (on per column (except for the last column)) which record the common value for the column. We'll call these vectors: "factor1", "factor2" and "factor3" respectively for the columns 1,2 and 3. As my matrices are large and numerous, I'd need something fast and efficient to compute.

For example I have this matrix:

    [,1][,2][,3][,4]
[1,]  1    1   1   0
[2,]  1    2   1   0
[3,]  3    2   1   0.6
[4,]  1    1   1   0.2
[5,]  1    2   1   0
[6,]  1    1   2   0.1
[7,]  3    2   1   0.9

Here we would group together rows 1 and 4 (based on equal values in the columns 1,2 and 3) and calculate the percentage of 0 (in the column 4)(%zero equals to 0.5)

Then we group the rows 2 and 5 and again calculate %zero (equals to 1)

Then we group the rows 3 and 7 and calculate the %zero (equals to 0)

Then row 6 is alone and its %zero (equals to 0)

Here are the vectors I want to get:

> data = c(0.5,1,0,0)

> factor1 = c(1,1,3,1)

> factor2 = c(1,2,2,1)

> factor3 = c(1,1,1,2)

The order of these values is not important. If the value 0.5 in the vector "data" is in position 2, so that the position 2 of all factors should be 1.

The goal is then to run the following aov:

> aov(data ~ factor1 * factor2 * factor3)

Thanks a lot for your help

Upvotes: 1

Views: 270

Answers (3)

John
John

Reputation: 23758

If your matrices are very large do not forget to include the cost of converting to data.table when trying that. The following should be pretty fast.

colnames(m) <- c(paste0('factor', 1:3), 'data')
aggregate(data ~ ., data = m, function(x) mean(x!=0))

And yet I tested it and it turns out that plyr is actually quite competitive with data.table here and aggregate quite far behind (3x). Recent versions plyr (1.8) have gotten much faster than they used to be.

In some testing I found that aggregate becomes much faster (2x) if you convert to a data.frame first (even if you include the data.frame conversion time).

None of the posted answers are really slow at all. If you've got lots of these matrices I imagine they're in files and if your code is quite slow I'm guessing you might have a bottleneck there. There are ways to optimize matrix reading from files (use scan instead of read.table)

(as an aside, you probably should not be doing ANOVA on this data)

Upvotes: 2

juba
juba

Reputation: 49033

Here is a solution with the plyr package. If m is your matrix :

m <- data.frame(m)
colnames(m) <- c("V1","V2","V3","data")
m2 <- ddply(m, .(V1,V2,V3), summarise, data=sum(data==0)/length(data))

Which gives :

  V1 V2 V3 data
1  1  1  1  0.5
2  1  1  2  0.0
3  1  2  1  1.0
4  3  2  1  0.0

You could then do :

aov(data=m2, data ~ V1 * V2 * V3)

plyr is not always the fastest solution, though.

Upvotes: 0

Roland
Roland

Reputation: 132706

Recreate your matrix:

df <- read.table(text="[,1] [,2] [,3] [,4]
[1,]  1    1   1   0
[2,]  1    2   1   0
[3,]  3    2   1   0.6
[4,]  1    1   1   0.2
[5,]  1    2   1   0
[6,]  1    1   2   0.1
[7,]  3    2   1   0.9",header=TRUE)
m <- as.matrix(df)
colnames(m) <- NULL

Use package data.table to split-apply-combine. data.table is recommended for huge datasets because of its efficiency.

library(data.table)
DT <- as.data.table(m)

DT.agg <- DT[,sum(V4==0)/length(V4),by=list(V1,V2,V3)]
setnames(DT.agg,c("factor1","factor2","factor3","data"))
print(DT.agg)
#   factor1 factor2 factor3 data
#1:       1       1       1  0.5
#2:       1       2       1  1.0
#3:       3       2       1  0.0
#4:       1       1       2  0.0

aov(data ~ factor1 * factor2 * factor3, data = DT.agg)

Upvotes: 2

Related Questions