user113156
user113156

Reputation: 7107

perform calculation on all combinations of columns of a matrix

I am trying to apply a function to a very large matrix I want to eventually create a (40,000 by 40,000) matrix (where only one side of the diagonal is completed) or create a list of the results.

The matrix looks like:

            obs 1     obs 2     obs 3     obs 4     obs 5     obs 6     obs 7     obs 8     obs 9
words 1 0.2875775 0.5999890 0.2875775 0.5999890 0.2875775 0.5999890 0.2875775 0.5999890 0.2875775
words 2 0.7883051 0.3328235 0.7883051 0.3328235 0.7883051 0.3328235 0.7883051 0.3328235 0.7883051
words 3 0.4089769 0.4886130 0.4089769 0.4886130 0.4089769 0.4886130 0.4089769 0.4886130 0.4089769
words 4 0.8830174 0.9544738 0.8830174 0.9544738 0.8830174 0.9544738 0.8830174 0.9544738 0.8830174
words 5 0.9404673 0.4829024 0.9404673 0.4829024 0.9404673 0.4829024 0.9404673 0.4829024 0.9404673
words 6 0.0455565 0.8903502 0.0455565 0.8903502 0.0455565 0.8903502 0.0455565 0.8903502 0.0455565

I use the function using cosine(mat[, 3], mat[, 4]) which gives me a single number.

          [,1]
[1,] 0.7546113

I can do this for all of the columns but I want to be able to know which columns they came from, i.e. the calculation above came from columns 3 and 4 which is "obs 3" and "obs 4".

Expected output might be the results in a list or a matrix like:

          [,1]   [,1]   [,1]
[1,]        1      .      .
[1,]      0.75     1      .
[1,]      0.23    0.87    1

(Where the numbers here are made up)

So the dimensions will be the size of the ncol(mat) by ncol(mat) (if I go the matrix method).

Data/Code:

#generate some data

mat <- matrix(data = runif(200), nrow = 100, ncol = 20, dimnames = list(paste("words", 1:100),
                                                                        paste("obs", 1:20)))


mat


#calculate the following function
library(lsa)
cosine(mat[, 3], mat[, 4])
cosine(mat[, 4], mat[, 5])
cosine(mat[, 5], mat[, 6])

Additional

I thought about doing the following: - Creating an empty matrix and calculating the function in a forloop but its not working as expected and creating a 40,000 by 40,000 matrix of 0's brings up memory issues.

co <- matrix(0L, nrow = ncol(mat), ncol = ncol(mat), dimnames = list(colnames(mat), colnames(mat)))
co

for (i in 2:ncol(mat)) {
  for (j in 1:(i - 1)) {
    co[i, j] = cosine(mat[, i], mat[, j])
  }
}

co

I also tried putting the results into a list:

List <- list()
for(i in 1:ncol(mat))
{
  temp <- List[[i]] <- mat
}

res <- List[1][[1]]
res

Which is also wrong.

So I am trying to create a function which will column by column calculate the function and store the results.

Upvotes: 2

Views: 300

Answers (4)

Marian Minar
Marian Minar

Reputation: 1456

We can use iteration over indexes using purrr (as a better(?) alternative to for loops). I think the toy dataset was supposed to have 2000, not 200 datapoints?

library(tidyverse)

mat <-
  matrix(
    data = runif(2000),
    nrow = 100,
    ncol = 20,
    dimnames = list(paste("words", 1:100),
                    paste("obs", 1:20))
  )

cos_summary <- tibble(Row1 = 3:5, Row2 = 4:6)

cos_summary <- cos_summary %>%
  mutate(cos_1_2 = map2_dbl(Row1, Row2, ~lsa::cosine(mat[,.x], mat[,.y])))

cos_summary

# A tibble: 3 x 3
   Row1  Row2 cos_1_2
  <int> <int>   <dbl>
1     3     4   0.710
2     4     5   0.734
3     5     6   0.751

Upvotes: 0

G. Grothendieck
G. Grothendieck

Reputation: 269644

1) Using mat shown in the question, the first line creates a 20x20 matrix with all 20*20 cosines filled in. The second line zeros out the values on and above the diagonal. Use lower.tri instead if you prefer that the values on and below the diagonal be zero'd out.

comat <- cosine(mat)
comat[upper.tri(comat, diag = TRUE)] <- 0

2) Alternately to create a named numeric vector of the results:

covec <- c(combn(as.data.frame(mat), 2, function(x) c(cosine(x[, 1], x[, 2]))))
names(covec) <- combn(colnames(mat), 2, paste, collapse = "-")

3) We can use the fact that the off-diagonal cosines are the same as correlations up to a factor, mult.

mult <- c(cosine(mat[, 1], mat[, 2]) / cor(mat[, 1], mat[, 2]))
co3 <- mult * cor(mat)
co3[upper.tri(co3, diag = TRUE)] <- 0

3a) This opens up using any of several correlation functions available in R. For example, using mult just calculated:

library(HiClimR)
co4 <- mult * fastCor(mat)
co4[upper.tri(co4, diag = TRUE)] <- 0

3b)

library(propagate)
co5 <- mult * bigcor(mat)
co5[upper.tri(co5, diag = TRUE)] <- 0

3c)

co6 <- crossprod(scale(mat)) / (nrow(mat) - 1)
co6[upper.tri(co6, diag = TRUE)] <- 0

Upvotes: 2

akrun
akrun

Reputation: 887128

We can do this with a nested sapply

i1 <- seq_len(ncol(mat))
sapply(i1, function(i) sapply(i1, function(j) cosine(mat[, i], mat[, j])))    #         [,1]      [,2]      [,3]      [,4]      [,5]      [,6]      [,7]      #[,8]      [,9]     [,10]     [,11]     [,12]
# [1,] 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016
# [2,] 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000
# [3,] 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016
# [4,] 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000
# [5,] 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016
# [6,] 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000
# [7,] 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016
# ....

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 388982

One option is to define a function to apply for two columns and then use outer to apply to all combination of columns.

fun <- function(x, y) {
   cosine(mat[, x], mat[, y])
}

outer(seq_len(ncol(mat)), seq_len(ncol(mat)), Vectorize(fun))

#       [,1]   [,2]   [,3]   [,4]   [,5]  ..... 
#[1,] 1.0000 0.7824 1.0000 0.7824 1.0000 .....
#[2,] 0.7824 1.0000 0.7824 1.0000 0.7824 .....
#[3,] 1.0000 0.7824 1.0000 0.7824 1.0000 .....
#[4,] 0.7824 1.0000 0.7824 1.0000 0.7824 .....
#[5,] 1.0000 0.7824 1.0000 0.7824 1.0000 .....
#....

Upvotes: 2

Related Questions