user6376297
user6376297

Reputation: 647

unexpected result from cmdscale in R

I'm afraid I'm missing something obvious, but I just can't see what I am doing wrong.
If anyone can help me find it, please, it would be great.

Here's the full, symmetrical distance matrix I'm starting from:

d2 <- structure(list(P1 = c(0, 0.1, 0.3, 0.2, 0, 0.1), P2 = c(0.1, 
0, 0.5, 0.7, 1, 0.9), P3 = c(0.3, 0.5, 0, 1, 0.2, 0.3), P4 = c(0.2, 
0.7, 1, 0, 0.2, 0.5), P5 = c(0, 1, 0.2, 0.2, 0, 0.7), P6 = c(0.1, 
0.9, 0.3, 0.5, 0.7, 0)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -6L))

sum(abs(d2-t(d2)))
#[1] 0

I want to generate coordinates for the corresponding 6 points, so that the (euclidean) distance matrix resulting from those coordinates is as close as possible to my d2.

From the cmdscale documentation:

A set of Euclidean distances on n points can be represented exactly in at most n - 1 dimensions.

I would have thought (n-1)/2 dimensions would suffice, and indeed, when I run cmdscale, if I go anywhere higher than k=3 I get something close to 0 for the higher coordinates, or even error messages:

cmdscale(d2,k=3)
#            [,1]        [,2]          [,3]
#[1,] -0.03526127  0.07755701  1.708755e-05
#[2,] -0.50626939  0.31256816 -5.646907e-02
#[3,] -0.26333957 -0.40518119 -6.978213e-02
#[4,]  0.35902238  0.37455879  2.148406e-02
#[5,]  0.33997864 -0.17998635 -2.809260e-01
#[6,]  0.10586921 -0.17951643  3.856760e-01

cmdscale(d2,k=4)
#            [,1]        [,2]          [,3]          [,4]
#[1,] -0.03526127  0.07755701  1.708755e-05 -7.450581e-09
#[2,] -0.50626939  0.31256816 -5.646907e-02 -7.450581e-09
#[3,] -0.26333957 -0.40518119 -6.978213e-02 -7.450581e-09
#[4,]  0.35902238  0.37455879  2.148406e-02 -7.450581e-09
#[5,]  0.33997864 -0.17998635 -2.809260e-01 -7.450581e-09
#[6,]  0.10586921 -0.17951643  3.856760e-01 -7.450581e-09

cmdscale(d2,k=5)
#            [,1]        [,2]          [,3]          [,4]
#[1,] -0.03526127  0.07755701  1.708755e-05 -7.450581e-09
#[2,] -0.50626939  0.31256816 -5.646907e-02 -7.450581e-09
#[3,] -0.26333957 -0.40518119 -6.978213e-02 -7.450581e-09
#[4,]  0.35902238  0.37455879  2.148406e-02 -7.450581e-09
#[5,]  0.33997864 -0.17998635 -2.809260e-01 -7.450581e-09
#[6,]  0.10586921 -0.17951643  3.856760e-01 -7.450581e-09
#Warning message:
#In cmdscale(d2, k = 5) : only 4 of the first 5 eigenvalues are > 0

So, assuming that k=3 is sufficient, this is what happens when I try to reverse the operation:

dd <- dist(cmdscale(d2,k=3),diag = T,upper = T)
dd
#          1         2         3         4         5         6
#1 0.0000000 0.5294049 0.5384495 0.4940956 0.5348482 0.4844970
#2 0.5294049 0.0000000 0.7578630 0.8710048 1.0045529 0.9013064
#3 0.5384495 0.7578630 0.0000000 1.0018275 0.6777074 0.6282371
#4 0.4940956 0.8710048 1.0018275 0.0000000 0.6319294 0.7097335
#5 0.5348482 1.0045529 0.6777074 0.6319294 0.0000000 0.7065166
#6 0.4844970 0.9013064 0.6282371 0.7097335 0.7065166 0.0000000

Which is quite different from what I expected:

as.matrix(dd)-d2
#         P1          P2          P3          P4          P5          P6
#1 0.0000000 0.429404930 0.238449457 0.294095619 0.534848178 0.384497043
#2 0.4294049 0.000000000 0.257862963 0.171004810 0.004552925 0.001306386
#3 0.2384495 0.257862963 0.000000000 0.001827507 0.477707386 0.328237091
#4 0.2940956 0.171004810 0.001827507 0.000000000 0.431929428 0.209733518
#5 0.5348482 0.004552925 0.477707386 0.431929428 0.000000000 0.006516573
#6 0.3844970 0.001306386 0.328237091 0.209733518 0.006516573 0.000000000

sum(abs(as.matrix(dd)-d2))
#[1] 7.543948

Has anyone got any idea why the two distance matrices don't match at all?

I could try building my own least squares problem to find the coordinates, but first I need to understand if I'm doing something wrong with these out of the box R functionalities.

Thanks!


EDIT possible inconsistency in the data found

Could the issue be that according to d2 points 1 and 5 coincide (they have distance 0):

as.matrix(d2)
#      P1  P2  P3  P4  P5  P6
#[1,] 0.0 0.1 0.3 0.2 0.0 0.1
#[2,] 0.1 0.0 0.5 0.7 1.0 0.9
#[3,] 0.3 0.5 0.0 1.0 0.2 0.3
#[4,] 0.2 0.7 1.0 0.0 0.2 0.5
#[5,] 0.0 1.0 0.2 0.2 0.0 0.7
#[6,] 0.1 0.9 0.3 0.5 0.7 0.0

but then these two points have different distances from other points, e.g. d(1-2) is 0.1 whereas d(5-2) is 1?

Replacing the two 0's does not seem to help though:

d3 <- d2
d3[1,5] <- 0.2
d3[5,1] <- 0.2
dd3 <- cmdscale(as.matrix(d3),k=3)
sum(abs(as.matrix(dist(dd3))-as.matrix(d3)))
#[1] 7.168348

Does this perhaps indicate that not all distance matrices can be reduced to a completely consistent set of points, regardless of how many dimensions one uses?


EDIT 2 possible answer to the last question.

I suspect that the answer is yes. And I was wrong on the number of dimensions, I see now why you need N-1 rather than half that.

If I have a distance d(A-B) = 1, I can represent that in 2-1 = 1 dimensions (x axis), i.e. on a line, placing A in (xA=0) and B in (xB=1).

Then I introduce a third point C and I state that d(A-C) = 2.
I have 3 points, so I need 3-1 = 2 dimensions (xy plane).
The constraint given by d(A-C) is:

(xC - 0)^2 + (yC - 0)^2 = d(A-C)^2 = 4.

i.e. C can be anywhere on a circumference of radius 2 centred in A.
This constrains both xC and yC to be in [-2,2].

However, previously I had not considered that this constrains the possible values of d(B-C), too, because:

d(B-C)^2 = (xC - 1)^2 + (yC - 0)^2

thus, by substitution of the (yC - 0)^2 term:

d(B-C)^2 = (xC - 1)^2 + 4 - (xC - 0)^2 = -2*xC + 5

d(B-C)^2 is therefore bound to [-2*(+2)+5,-2*(-2)+5] = [1,9].

So if my distance matrix contained d(A-B) = 1, d(A-C) = 2 and d(B-C) anywhere outside [1,3], it would configure a system that does not correspond to 3 points in Euclidean space.

At least, I hope this makes sense.

So I guess my original question must be revoked.
I thought I'd leave the reasoning here for future reference or if anyone else should have the same doubt.

Upvotes: 0

Views: 1868

Answers (1)

dcarlson
dcarlson

Reputation: 11076

Multidimensional scaling creates coordinates for the specified number of dimensions such that they will represent the distances in the original matrix as closely as possible. But the distances will be at different scales. In your example, d3 is the original distance matrix, dd3 is the matrix of coordinates, and dist(dd3) is the distance matrix from the reconstructed coordinates. The values are different, but they reflect the same relationships between points:

d3.v <- as.vector(as.dist(d3))  # Vector of original distances
dd3.v <- as.vector(dist(dd3))   # Vector of distances computed from coordinates
cor(d3.v, dd3.v)
# [1] 0.9433903
plot(d3.v, dd3.v, pch=16)

Plot

Upvotes: 0

Related Questions