Achal Neupane
Achal Neupane

Reputation: 5719

How to create a combination matrix of items in column added together

I have a matrix called mymat. I want to create another matrix with the pairwise combination of all the items in mymat and their values added together and get something like the result.

mymat<- structure(c("AOGC-03-0122", "AOGC-05-0009", "AOGC-08-0006", "AOGC-08-0032", 
"AOGC-08-0054", "0.000971685122254438", "0.00114138129544444", 
"0.000779586347096811", "0.00132807674454652", "0.000867219894408284"
), .Dim = c(5L, 2L), .Dimnames = list(NULL, c("samples", "value"
)))

result

combination                      total.value
AOGC-03-0122+AOGC-03-0122         0.00194337
AOGC-03-0122+AOGC-05-0009         0.002113066
.
.
.
AOGC-08-0054+AOGC-08-0054         0.00173444

Upvotes: 1

Views: 49

Answers (2)

bgoldst
bgoldst

Reputation: 35314

  • A matrix is a homogeneous data object. It is basically a matrix-classed atomic vector with a dimension attribute (ignoring the case of a matrix of lists). You cannot have a combination of strings and numbers in a single matrix. When you want to store a table of data with heterogeneous column types you should be using a data.frame. It definitely appears that the appropriate types of the samples and value columns are string and number, respectively. Hence, your input matrix should really be a data.frame, and your output should be a data.frame as well, since it merely permutes the input records.

  • You shouldn't need to call merge() here, and certainly not twice; vectorized indexing can do the job. And using merge() will cause the permutation order to depend on the lexicographic order of the samples values, rather than the order in which they occur in the input, which is probably undesirable.


values <- as.double(mymat[,'value']);
with(expand.grid(rep(list(seq_len(nrow(mymat))),2L)),
    data.frame(
        combination=paste(mymat[Var2,'samples'],mymat[Var1,'samples'],sep='+'),
        total.value=values[Var2]+values[Var1]
    )
);
##                  combination total.value
## 1  AOGC-03-0122+AOGC-03-0122 0.001943370
## 2  AOGC-03-0122+AOGC-05-0009 0.002113066
## 3  AOGC-03-0122+AOGC-08-0006 0.001751271
## 4  AOGC-03-0122+AOGC-08-0032 0.002299762
## 5  AOGC-03-0122+AOGC-08-0054 0.001838905
## 6  AOGC-05-0009+AOGC-03-0122 0.002113066
## 7  AOGC-05-0009+AOGC-05-0009 0.002282763
## 8  AOGC-05-0009+AOGC-08-0006 0.001920968
## 9  AOGC-05-0009+AOGC-08-0032 0.002469458
## 10 AOGC-05-0009+AOGC-08-0054 0.002008601
## 11 AOGC-08-0006+AOGC-03-0122 0.001751271
## 12 AOGC-08-0006+AOGC-05-0009 0.001920968
## 13 AOGC-08-0006+AOGC-08-0006 0.001559173
## 14 AOGC-08-0006+AOGC-08-0032 0.002107663
## 15 AOGC-08-0006+AOGC-08-0054 0.001646806
## 16 AOGC-08-0032+AOGC-03-0122 0.002299762
## 17 AOGC-08-0032+AOGC-05-0009 0.002469458
## 18 AOGC-08-0032+AOGC-08-0006 0.002107663
## 19 AOGC-08-0032+AOGC-08-0032 0.002656153
## 20 AOGC-08-0032+AOGC-08-0054 0.002195297
## 21 AOGC-08-0054+AOGC-03-0122 0.001838905
## 22 AOGC-08-0054+AOGC-05-0009 0.002008601
## 23 AOGC-08-0054+AOGC-08-0006 0.001646806
## 24 AOGC-08-0054+AOGC-08-0032 0.002195297
## 25 AOGC-08-0054+AOGC-08-0054 0.001734440

Performance

bgoldst <- function(mymat) { values <- as.double(mymat[,'value']); with(expand.grid(rep(list(seq_len(nrow(mymat))),2L)),data.frame(combination=paste(mymat[Var2,'samples'],mymat[Var1,'samples'],sep='+'),total.value=values[Var2]+values[Var1])); };
akrun <- function(mymat) { d1 <- expand.grid(rep(list(mymat[, "samples"]),2)); d2 <-  data.frame(samples=mymat[,1], value = as.numeric(mymat[,2]), stringsAsFactors=FALSE); d3 <- merge(merge(d1, d2, by.x="Var1", by.y="samples", all.x=TRUE), d2, by.x="Var2", by.y= "samples"); res <- data.frame(combination = do.call(paste, c(d3[1:2], sep="+")), total.value = d3[,3]+d3[,4]); };
identical(bgoldst(mymat),akrun(mymat));
## [1] TRUE

library(microbenchmark);
microbenchmark(bgoldst(mymat),akrun(mymat));
## Unit: microseconds
##            expr      min       lq      mean    median       uq      max neval
##  bgoldst(mymat)  390.875  412.685  444.4554  433.8535  457.589  662.434   100
##    akrun(mymat) 1603.697 1658.009 1789.0585 1692.0075 1824.793 3227.921   100

N <- 1e3; mymat <- matrix(c(sprintf('sample_%d',seq_len(N)),runif(N)),ncol=2L,dimnames=list(NULL,c('samples','value')));
x <- bgoldst(mymat); y <- akrun(mymat); identical(structure(transform(x[order(x$combination),],combination=as.character(combination)),row.names=seq_len(nrow(x))),structure(transform(y[order(y$combination),],combination=as.character(combination)),row.names=seq_len(nrow(y)))); ## annoyingly involved line of code to obviate row order, factor levels order, and row names differences
## [1] TRUE
microbenchmark(bgoldst(mymat),akrun(mymat),times=3L);
## Unit: seconds
##            expr       min        lq      mean    median        uq       max neval
##  bgoldst(mymat)  8.103589  8.328722  8.418285  8.553854  8.575633  8.597411     3
##    akrun(mymat) 30.777301 31.152458 31.348615 31.527615 31.634272 31.740929     3

Upvotes: 2

akrun
akrun

Reputation: 887213

We can use expand.grid with merge

d1 <- expand.grid(rep(list(mymat[, "samples"]),2))
d2 <-  data.frame(samples=mymat[,1], value = as.numeric(mymat[,2]), 
                 stringsAsFactors=FALSE)
d3 <- merge(merge(d1, d2, by.x="Var1", by.y="samples", all.x=TRUE),
                     d2, by.x="Var2", by.y= "samples")
res <- data.frame(combination = do.call(paste, c(d3[1:2], sep="+")), 
                     total.value = d3[,3]+d3[,4])
head(res,3)
#                combination total.value
#1 AOGC-03-0122+AOGC-03-0122 0.001943370
#2 AOGC-03-0122+AOGC-05-0009 0.002113066
#3 AOGC-03-0122+AOGC-08-0006 0.001751271

Upvotes: 1

Related Questions