h7681
h7681

Reputation: 355

Distance matrix from two separate data frames

I'd like to create a matrix which contains the euclidean distances of the rows from one data frame versus the rows from another. For example, say I have the following data frames:

a <- c(1,2,3,4,5)
b <- c(5,4,3,2,1)
c <- c(5,4,1,2,3)
df1 <- data.frame(a,b,c)

a2 <- c(2,7,1,2,3)
b2 <- c(7,6,5,4,3)
c2 <- c(1,2,3,4,5)
df2 <- data.frame(a2,b2,c2)

I would like to create a matrix with the distances of each row in df1 versus the rows of df2.

So matrix[2,1] should be the euclidean distance between df1[2,] and df2[1,]. matrix[3,2] the distance between df[3,] and df2[2,], etc.

Does anyone know how this could be achieved?

Upvotes: 5

Views: 3735

Answers (2)

Diego
Diego

Reputation: 156

Perhaps you could use the fields package: the function rdist might do what you want:

rdist : Euclidean distance matrix
Description: Given two sets of locations computes the Euclidean distance matrix among all pairings.

> rdist(df1, df2)
     [,1]     [,2]     [,3]     [,4]     [,5]
[1,] 4.582576 6.782330 2.000000 1.732051 2.828427
[2,] 4.242641 5.744563 1.732051 0.000000 1.732051
[3,] 4.123106 5.099020 3.464102 3.316625 4.000000
[4,] 5.477226 5.000000 4.358899 3.464102 3.316625
[5,] 7.000000 5.477226 5.656854 4.358899 3.464102

Similar is the case with the pdist package

pdist : Distances between Observations for a Partitioned Matrix
Description: Computes the euclidean distance between rows of a matrix X and rows of another matrix Y.

> pdist(df1, df2)
An object of class "pdist"
Slot "dist":
[1] 4.582576 6.782330 2.000000 1.732051 2.828427 4.242640 5.744563 1.732051
[9] 0.000000 1.732051 4.123106 5.099020 3.464102 3.316625 4.000000 5.477226
[17] 5.000000 4.358899 3.464102 3.316625 7.000000 5.477226 5.656854 4.358899
[25] 3.464102
attr(,"Csingle")
[1] TRUE

Slot "n":
[1] 5

Slot "p":
[1] 5

Slot ".S3Class":
[1] "pdist"
#

NOTE: If you're looking for the Euclidean norm between rows, you might want to try:

a <- c(1,2,3,4,5)
b <- c(5,4,3,2,1)
c <- c(5,4,1,2,3)
df1 <- rbind(a, b, c)

a2 <- c(2,7,1,2,3)
b2 <- c(7,6,5,4,3)
c2 <- c(1,2,3,4,5)
df2 <- rbind(a2,b2,c2)

rdist(df1, df2)

This gives:

> rdist(df1, df2)
         [,1]     [,2]     [,3]
[1,] 6.164414 7.745967 0.000000
[2,] 5.099020 4.472136 6.324555
[3,] 4.242641 5.291503 5.656854

Upvotes: 8

aichao
aichao

Reputation: 7435

This is adapted from my previous answer here.

For general n-dimensional Euclidean distance, we can exploit the equation (not R, but algebra):

square_dist(b,a) = sum_i(b[i]*b[i]) + sum_i(a[i]*a[i]) - 2*inner_prod(b,a)

where the sums are over the dimensions of vectors a and b for i=[1,n]. Here, a and b are one pair of columns from df1 and df2, respectively. The key here is that this equation can be written as a matrix equation for all pairs in df1 and df2.

In code:

d <- sqrt(matrix(rowSums(expand.grid(rowSums(df1*df1),rowSums(df2*df2))),
                 nrow=nrow(df1)) - 
          2. * as.matrix(df1) %*% t(as.matrix(df2)))

Notes:

  1. The inner rowSums compute sum_i(a[i]*a[i]) and sum_i(b[i]*b[i]) for each a in df1 and b in df2, respectively.
  2. expand.grid then generates all pairs between df1 and df2.
  3. The outer rowSums computes the sum_i(a[i]*a[i]) + sum_i(b[i]*b[i]) for all these pairs.
  4. This result is then reshaped into a matrix. Note that the number of rows of this matrix is the number of rows of df1.
  5. Then subtract two times the inner product of all pairs. This inner product can be written as a matrix multiply df1 %*% t(df2) where I left out the coercion to matrix for clarity.
  6. Finally, take the square root.

Using this code with your data:

print(d)
##         [,1]     [,2]     [,3]     [,4]     [,5]
##[1,] 4.582576 6.782330 2.000000 1.732051 2.828427
##[2,] 4.242641 5.744563 1.732051 0.000000 1.732051
##[3,] 4.123106 5.099020 3.464102 3.316625 4.000000
##[4,] 5.477226 5.000000 4.358899 3.464102 3.316625
##[5,] 7.000000 5.477226 5.656854 4.358899 3.464102

Note that this code will work for any n > 1. In your case, n=3.

Upvotes: 2

Related Questions