Reputation: 7107
I am trying to apply a function to a very large matrix I want to eventually create a (40,000 by 40,000
) matrix (where only one side of the diagonal is completed) or create a list of the results.
The matrix looks like:
obs 1 obs 2 obs 3 obs 4 obs 5 obs 6 obs 7 obs 8 obs 9
words 1 0.2875775 0.5999890 0.2875775 0.5999890 0.2875775 0.5999890 0.2875775 0.5999890 0.2875775
words 2 0.7883051 0.3328235 0.7883051 0.3328235 0.7883051 0.3328235 0.7883051 0.3328235 0.7883051
words 3 0.4089769 0.4886130 0.4089769 0.4886130 0.4089769 0.4886130 0.4089769 0.4886130 0.4089769
words 4 0.8830174 0.9544738 0.8830174 0.9544738 0.8830174 0.9544738 0.8830174 0.9544738 0.8830174
words 5 0.9404673 0.4829024 0.9404673 0.4829024 0.9404673 0.4829024 0.9404673 0.4829024 0.9404673
words 6 0.0455565 0.8903502 0.0455565 0.8903502 0.0455565 0.8903502 0.0455565 0.8903502 0.0455565
I use the function using cosine(mat[, 3], mat[, 4])
which gives me a single number.
[,1]
[1,] 0.7546113
I can do this for all of the columns but I want to be able to know which columns they came from, i.e. the calculation above came from columns 3
and 4
which is "obs 3"
and "obs 4"
.
Expected output might be the results in a list or a matrix like:
[,1] [,1] [,1]
[1,] 1 . .
[1,] 0.75 1 .
[1,] 0.23 0.87 1
(Where the numbers here are made up)
So the dimensions will be the size of the ncol(mat)
by ncol(mat)
(if I go the matrix method).
Data/Code:
#generate some data
mat <- matrix(data = runif(200), nrow = 100, ncol = 20, dimnames = list(paste("words", 1:100),
paste("obs", 1:20)))
mat
#calculate the following function
library(lsa)
cosine(mat[, 3], mat[, 4])
cosine(mat[, 4], mat[, 5])
cosine(mat[, 5], mat[, 6])
I thought about doing the following:
- Creating an empty matrix and calculating the function in a forloop but its not working as expected and creating a 40,000 by 40,000
matrix of 0's brings up memory issues.
co <- matrix(0L, nrow = ncol(mat), ncol = ncol(mat), dimnames = list(colnames(mat), colnames(mat)))
co
for (i in 2:ncol(mat)) {
for (j in 1:(i - 1)) {
co[i, j] = cosine(mat[, i], mat[, j])
}
}
co
I also tried putting the results into a list:
List <- list()
for(i in 1:ncol(mat))
{
temp <- List[[i]] <- mat
}
res <- List[1][[1]]
res
Which is also wrong.
So I am trying to create a function which will column by column calculate the function and store the results.
Upvotes: 2
Views: 300
Reputation: 1456
We can use iteration over indexes using purrr
(as a better(?) alternative to for loops). I think the toy dataset was supposed to have 2000, not 200 datapoints?
library(tidyverse)
mat <-
matrix(
data = runif(2000),
nrow = 100,
ncol = 20,
dimnames = list(paste("words", 1:100),
paste("obs", 1:20))
)
cos_summary <- tibble(Row1 = 3:5, Row2 = 4:6)
cos_summary <- cos_summary %>%
mutate(cos_1_2 = map2_dbl(Row1, Row2, ~lsa::cosine(mat[,.x], mat[,.y])))
cos_summary
# A tibble: 3 x 3
Row1 Row2 cos_1_2
<int> <int> <dbl>
1 3 4 0.710
2 4 5 0.734
3 5 6 0.751
Upvotes: 0
Reputation: 269644
1) Using mat
shown in the question, the first line creates a 20x20 matrix with all 20*20 cosines filled in. The second line zeros out the values on and above the diagonal. Use lower.tri
instead if you prefer that the values on and below the diagonal be zero'd out.
comat <- cosine(mat)
comat[upper.tri(comat, diag = TRUE)] <- 0
2) Alternately to create a named numeric vector of the results:
covec <- c(combn(as.data.frame(mat), 2, function(x) c(cosine(x[, 1], x[, 2]))))
names(covec) <- combn(colnames(mat), 2, paste, collapse = "-")
3) We can use the fact that the off-diagonal cosines are the same as correlations up to a factor, mult
.
mult <- c(cosine(mat[, 1], mat[, 2]) / cor(mat[, 1], mat[, 2]))
co3 <- mult * cor(mat)
co3[upper.tri(co3, diag = TRUE)] <- 0
3a) This opens up using any of several correlation functions available in R. For example, using mult
just calculated:
library(HiClimR)
co4 <- mult * fastCor(mat)
co4[upper.tri(co4, diag = TRUE)] <- 0
3b)
library(propagate)
co5 <- mult * bigcor(mat)
co5[upper.tri(co5, diag = TRUE)] <- 0
3c)
co6 <- crossprod(scale(mat)) / (nrow(mat) - 1)
co6[upper.tri(co6, diag = TRUE)] <- 0
Upvotes: 2
Reputation: 887128
We can do this with a nested sapply
i1 <- seq_len(ncol(mat))
sapply(i1, function(i) sapply(i1, function(j) cosine(mat[, i], mat[, j]))) # [,1] [,2] [,3] [,4] [,5] [,6] [,7] #[,8] [,9] [,10] [,11] [,12]
# [1,] 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016
# [2,] 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000
# [3,] 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016
# [4,] 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000
# [5,] 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016
# [6,] 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000
# [7,] 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016 1.0000000 0.7849016
# ....
Upvotes: 1
Reputation: 388982
One option is to define a function to apply for two columns and then use outer
to apply to all combination of columns.
fun <- function(x, y) {
cosine(mat[, x], mat[, y])
}
outer(seq_len(ncol(mat)), seq_len(ncol(mat)), Vectorize(fun))
# [,1] [,2] [,3] [,4] [,5] .....
#[1,] 1.0000 0.7824 1.0000 0.7824 1.0000 .....
#[2,] 0.7824 1.0000 0.7824 1.0000 0.7824 .....
#[3,] 1.0000 0.7824 1.0000 0.7824 1.0000 .....
#[4,] 0.7824 1.0000 0.7824 1.0000 0.7824 .....
#[5,] 1.0000 0.7824 1.0000 0.7824 1.0000 .....
#....
Upvotes: 2