Reputation: 355
I'd like to create a matrix which contains the euclidean distances of the rows from one data frame versus the rows from another. For example, say I have the following data frames:
a <- c(1,2,3,4,5)
b <- c(5,4,3,2,1)
c <- c(5,4,1,2,3)
df1 <- data.frame(a,b,c)
a2 <- c(2,7,1,2,3)
b2 <- c(7,6,5,4,3)
c2 <- c(1,2,3,4,5)
df2 <- data.frame(a2,b2,c2)
I would like to create a matrix with the distances of each row in df1 versus the rows of df2.
So matrix[2,1] should be the euclidean distance between df1[2,] and df2[1,]. matrix[3,2] the distance between df[3,] and df2[2,], etc.
Does anyone know how this could be achieved?
Upvotes: 5
Views: 3735
Reputation: 156
Perhaps you could use the fields
package: the function rdist
might do what you want:
rdist : Euclidean distance matrix
Description: Given two sets of locations computes the Euclidean distance matrix among all pairings.
> rdist(df1, df2)
[,1] [,2] [,3] [,4] [,5]
[1,] 4.582576 6.782330 2.000000 1.732051 2.828427
[2,] 4.242641 5.744563 1.732051 0.000000 1.732051
[3,] 4.123106 5.099020 3.464102 3.316625 4.000000
[4,] 5.477226 5.000000 4.358899 3.464102 3.316625
[5,] 7.000000 5.477226 5.656854 4.358899 3.464102
Similar is the case with the pdist
package
pdist : Distances between Observations for a Partitioned Matrix
Description: Computes the euclidean distance between rows of a matrix X and rows of another matrix Y.
> pdist(df1, df2)
An object of class "pdist"
Slot "dist":
[1] 4.582576 6.782330 2.000000 1.732051 2.828427 4.242640 5.744563 1.732051
[9] 0.000000 1.732051 4.123106 5.099020 3.464102 3.316625 4.000000 5.477226
[17] 5.000000 4.358899 3.464102 3.316625 7.000000 5.477226 5.656854 4.358899
[25] 3.464102
attr(,"Csingle")
[1] TRUE
Slot "n":
[1] 5
Slot "p":
[1] 5
Slot ".S3Class":
[1] "pdist"
#
NOTE: If you're looking for the Euclidean norm between rows, you might want to try:
a <- c(1,2,3,4,5)
b <- c(5,4,3,2,1)
c <- c(5,4,1,2,3)
df1 <- rbind(a, b, c)
a2 <- c(2,7,1,2,3)
b2 <- c(7,6,5,4,3)
c2 <- c(1,2,3,4,5)
df2 <- rbind(a2,b2,c2)
rdist(df1, df2)
This gives:
> rdist(df1, df2)
[,1] [,2] [,3]
[1,] 6.164414 7.745967 0.000000
[2,] 5.099020 4.472136 6.324555
[3,] 4.242641 5.291503 5.656854
Upvotes: 8
Reputation: 7435
This is adapted from my previous answer here.
For general n
-dimensional Euclidean distance, we can exploit the equation (not R, but algebra):
square_dist(b,a) = sum_i(b[i]*b[i]) + sum_i(a[i]*a[i]) - 2*inner_prod(b,a)
where the sums are over the dimensions of vectors a
and b
for i=[1,n]
. Here, a
and b
are one pair of columns from df1
and df2
, respectively. The key here is that this equation can be written as a matrix equation for all pairs in df1
and df2
.
In code:
d <- sqrt(matrix(rowSums(expand.grid(rowSums(df1*df1),rowSums(df2*df2))),
nrow=nrow(df1)) -
2. * as.matrix(df1) %*% t(as.matrix(df2)))
Notes:
rowSums
compute sum_i(a[i]*a[i])
and sum_i(b[i]*b[i])
for each a
in df1
and b
in df2
, respectively.expand.grid
then generates all pairs between df1
and df2
.rowSums
computes the sum_i(a[i]*a[i]) + sum_i(b[i]*b[i])
for all these pairs.matrix
. Note that the number of rows of this matrix is the number of rows of df1
.df1 %*% t(df2)
where I left out the coercion to matrix for clarity.Using this code with your data:
print(d)
## [,1] [,2] [,3] [,4] [,5]
##[1,] 4.582576 6.782330 2.000000 1.732051 2.828427
##[2,] 4.242641 5.744563 1.732051 0.000000 1.732051
##[3,] 4.123106 5.099020 3.464102 3.316625 4.000000
##[4,] 5.477226 5.000000 4.358899 3.464102 3.316625
##[5,] 7.000000 5.477226 5.656854 4.358899 3.464102
Note that this code will work for any n > 1
. In your case, n=3
.
Upvotes: 2