construct distance matrix in R, but from multiple input matrices

Question

There are some R functions to construct distance matrices by inputing a matrix/data frame (x) and specifying a distance measure (e.g. Euclidean), such as the dist function in stats R package (default). The proxy R package has a dist function (yes, the same name) that extends the stats:dist: it has the argument method from which users can pass a function, a registry entry, or a mnemonic string referencing the proximity measure. This is very convenient if users have their own distance measure programmed as a function. For example (from help document in proxy):

## input matrix
x <- matrix(rnorm(16), ncol = 4)
## custom distance function
f <- function(x, y) sum(x * y)
dist(x, f)

The resultant distance matrix indicates that (for instance) the distance between row 1 and row 2 of x is 2.32, which can be manually calculated as sum(x[1,]*x[2,]). Note that the function f takes two arguments x and y, which are essentially two rows of the input matrix x in the proxy:dist function. In other words, the distance calculation relies entirely on the input matrix x alone.

Here is my question: I also want to calculate a distance matrix for the input matrix x (i.e. rows are observations and I want to get the pairwise distance between rows of x). However, the function I use to calculate the distance does NOT rely solely on the input matrix x but actually on some matrices derived from x. I store the necessary matrices in a list called prep_matrices, which consists of three matrices: A,B,C (I made up these for reproducible results):

set.seed(111)
A = matrix(rnorm(9), nr=3)
set.seed(222)
B = matrix(rnorm(9), nr=3)
set.seed(333)
C = matrix(rnorm(9), nr=3)

Obviously the input matrix x is 3-by-3 and prep_matrices$A, prep_matrices$B, prep_matrices$C will give the derived matrices from x. Now assume that the distance between two rows of x is calculated as (for instance, row 1 and row 2):

m1 = diag(A[1, ])
m2 = diag(A[2, ])
b1 = B[1, ]
b2 = B[2, ]
c1 = C[1, ]
c2 = C[2, ]
distance = mean(m1 %*% ( (diag(b1)-diag(b2)) %*% (diag(c1)-diag(c2)) %*% m2))

This example is for illustrations only, but I hope you'll get the idea of how the distance is calculated. I realize, then, that it might be impossible to pass a list (prep_matrices) to some R functions and get the distance directly, as there are more extra calculations involved and most importantly, the distance is not based on the input matrix but instead on many derived matrices...

Is there a way to efficiently code in R to get a distance matrix in this case? Or we could possibly modify existing R functions? Thanks a lot!

mrip · Accepted Answer

Depending on how complicated the distance function is, you could just forget about dist and write a function that takes in row numbers i,j and computes the distance of those two rows. So for your example, it would look like this:

ff<-function(i,j) mean(diag(A[i,]) %*% ( (diag(B[i,])-diag(B[j,])) %*% (diag(C[i,])-diag(C[j,])) %*% diag(A[j,])))

Then you could get the distance matrix by applying this to 1:nrow(x) which in this case would be

distMatrix<-outer(1:3,1:3,Vectorize(ff))

The Vectorize is necessary because outer expects a vectorized function.

construct distance matrix in R, but from multiple input matrices

Answers (1)

Related Questions