Reputation: 262
I was wondering if there is a way to do the below matrix population using the apply function? I have read that apply is more efficient than for loops. For some reason I struggle with the apply function family.
I am looking to populate a matrix. I am using data in a data frame as conditions of the data population.
Below is a sample of the data logic; it is simplified but it covers the essentials of the population being rule based.
id_1 <- c(1, 1, 1, 1, 1, 1)
id_2 <- c(1, 1, 2, 2, 3, 3)
id_3 <- c(1, 2, 2, 3, 3, 4)
amt <- c(10, 15, 20, 25, 30, 35)
sample_data <- data.frame(id_1, id_2, id_3, amt)
n <- length(sample_data)
cor <- matrix(ncol = n, nrow = n)
i <- 1
j <- 1
for (i in 1:n) {
for (j in 1:n) {
if (i == j) {
cor[i,j] = 1
} else if (sample_data[2][i,] == sample_data[2][j,] & sample_data[3][i,] != sample_data[3][j,]) {
cor[i,j] = 0
} else if (sample_data[2][i,] != sample_data[2][j,] & sample_data[3][i,] == sample_data[3][j,]) {
cor[i,j] = 0.5
} else {
cor[i,j] = 0.25
}
}
}
cor
[,1] [,2] [,3] [,4]
[1,] 1.00 0.00 0.25 0.25
[2,] 0.00 1.00 0.50 0.25
[3,] 0.25 0.50 1.00 0.00
[4,] 0.25 0.25 0.00 1.00
Upvotes: 0
Views: 119
Reputation: 44340
apply
is not more efficient than a for loop, so that's not a good approach if you are looking for efficiency. Instead, you should use vectorized operations. Let's take apart your for loop:
First, elements take value 1 if they are on the diagonal, which can be achieved with the diag
function:
diag(n)
# [,1] [,2] [,3] [,4]
# [1,] 1 0 0 0
# [2,] 0 1 0 0
# [3,] 0 0 1 0
# [4,] 0 0 0 1
Off-diagonal entries (i, j) take value 0.5 if the entry i and j in the second column of sample_data
don't match and if entries i and j in the third column of sample_data
match. This can be achieved with the vectorized outer
function:
topn.2 <- head(sample_data[,2], n)
topn.3 <- head(sample_data[,3], n)
0.5 * (outer(topn.2, topn.2, "!=") & outer(topn.3, topn.3, "=="))
# [,1] [,2] [,3] [,4]
# [1,] 0 0.0 0.0 0
# [2,] 0 0.0 0.5 0
# [3,] 0 0.5 0.0 0
# [4,] 0 0.0 0.0 0
Off-diagonal entries (i, j) take value 0.25 if entry i and j match in either both columns 2 and 3 or in neither. Again, this can be achieved with outer
:
0.25 * (outer(1:n, 1:n, "!=") & (outer(topn.2, topn.2, "==") + outer(topn.3, topn.3, "==")) != 1)
# [,1] [,2] [,3] [,4]
# [1,] 0.00 0.00 0.25 0.25
# [2,] 0.00 0.00 0.00 0.25
# [3,] 0.25 0.00 0.00 0.00
# [4,] 0.25 0.25 0.00 0.00
Adding everything together yields a fully vectorized replacement to the for loop:
diag(n) +
0.5 * (outer(topn.2, topn.2, "!=") & outer(topn.3, topn.3, "==")) +
0.25 * (outer(1:n, 1:n, "!=") & (outer(topn.2, topn.2, "==") + outer(topn.3, topn.3, "==")) != 1)
# [,1] [,2] [,3] [,4]
# [1,] 1.00 0.00 0.25 0.25
# [2,] 0.00 1.00 0.50 0.25
# [3,] 0.25 0.50 1.00 0.00
# [4,] 0.25 0.25 0.00 1.00
Upvotes: 3