unexpected result from cmdscale in R

Question

I'm afraid I'm missing something obvious, but I just can't see what I am doing wrong.
If anyone can help me find it, please, it would be great.

Here's the full, symmetrical distance matrix I'm starting from:

d2 <- structure(list(P1 = c(0, 0.1, 0.3, 0.2, 0, 0.1), P2 = c(0.1, 
0, 0.5, 0.7, 1, 0.9), P3 = c(0.3, 0.5, 0, 1, 0.2, 0.3), P4 = c(0.2, 
0.7, 1, 0, 0.2, 0.5), P5 = c(0, 1, 0.2, 0.2, 0, 0.7), P6 = c(0.1, 
0.9, 0.3, 0.5, 0.7, 0)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -6L))

sum(abs(d2-t(d2)))
#[1] 0

I want to generate coordinates for the corresponding 6 points, so that the (euclidean) distance matrix resulting from those coordinates is as close as possible to my d2.

From the cmdscale documentation:

A set of Euclidean distances on n points can be represented exactly in at most n - 1 dimensions.

I would have thought (n-1)/2 dimensions would suffice, and indeed, when I run cmdscale, if I go anywhere higher than k=3 I get something close to 0 for the higher coordinates, or even error messages:

cmdscale(d2,k=3)
#            [,1]        [,2]          [,3]
#[1,] -0.03526127  0.07755701  1.708755e-05
#[2,] -0.50626939  0.31256816 -5.646907e-02
#[3,] -0.26333957 -0.40518119 -6.978213e-02
#[4,]  0.35902238  0.37455879  2.148406e-02
#[5,]  0.33997864 -0.17998635 -2.809260e-01
#[6,]  0.10586921 -0.17951643  3.856760e-01

cmdscale(d2,k=4)
#            [,1]        [,2]          [,3]          [,4]
#[1,] -0.03526127  0.07755701  1.708755e-05 -7.450581e-09
#[2,] -0.50626939  0.31256816 -5.646907e-02 -7.450581e-09
#[3,] -0.26333957 -0.40518119 -6.978213e-02 -7.450581e-09
#[4,]  0.35902238  0.37455879  2.148406e-02 -7.450581e-09
#[5,]  0.33997864 -0.17998635 -2.809260e-01 -7.450581e-09
#[6,]  0.10586921 -0.17951643  3.856760e-01 -7.450581e-09

cmdscale(d2,k=5)
#            [,1]        [,2]          [,3]          [,4]
#[1,] -0.03526127  0.07755701  1.708755e-05 -7.450581e-09
#[2,] -0.50626939  0.31256816 -5.646907e-02 -7.450581e-09
#[3,] -0.26333957 -0.40518119 -6.978213e-02 -7.450581e-09
#[4,]  0.35902238  0.37455879  2.148406e-02 -7.450581e-09
#[5,]  0.33997864 -0.17998635 -2.809260e-01 -7.450581e-09
#[6,]  0.10586921 -0.17951643  3.856760e-01 -7.450581e-09
#Warning message:
#In cmdscale(d2, k = 5) : only 4 of the first 5 eigenvalues are > 0

So, assuming that k=3 is sufficient, this is what happens when I try to reverse the operation:

dd <- dist(cmdscale(d2,k=3),diag = T,upper = T)
dd
#          1         2         3         4         5         6
#1 0.0000000 0.5294049 0.5384495 0.4940956 0.5348482 0.4844970
#2 0.5294049 0.0000000 0.7578630 0.8710048 1.0045529 0.9013064
#3 0.5384495 0.7578630 0.0000000 1.0018275 0.6777074 0.6282371
#4 0.4940956 0.8710048 1.0018275 0.0000000 0.6319294 0.7097335
#5 0.5348482 1.0045529 0.6777074 0.6319294 0.0000000 0.7065166
#6 0.4844970 0.9013064 0.6282371 0.7097335 0.7065166 0.0000000

Which is quite different from what I expected:

as.matrix(dd)-d2
#         P1          P2          P3          P4          P5          P6
#1 0.0000000 0.429404930 0.238449457 0.294095619 0.534848178 0.384497043
#2 0.4294049 0.000000000 0.257862963 0.171004810 0.004552925 0.001306386
#3 0.2384495 0.257862963 0.000000000 0.001827507 0.477707386 0.328237091
#4 0.2940956 0.171004810 0.001827507 0.000000000 0.431929428 0.209733518
#5 0.5348482 0.004552925 0.477707386 0.431929428 0.000000000 0.006516573
#6 0.3844970 0.001306386 0.328237091 0.209733518 0.006516573 0.000000000

sum(abs(as.matrix(dd)-d2))
#[1] 7.543948

Has anyone got any idea why the two distance matrices don't match at all?

I could try building my own least squares problem to find the coordinates, but first I need to understand if I'm doing something wrong with these out of the box R functionalities.

Thanks!

EDIT possible inconsistency in the data found

Could the issue be that according to d2 points 1 and 5 coincide (they have distance 0):

as.matrix(d2)
#      P1  P2  P3  P4  P5  P6
#[1,] 0.0 0.1 0.3 0.2 0.0 0.1
#[2,] 0.1 0.0 0.5 0.7 1.0 0.9
#[3,] 0.3 0.5 0.0 1.0 0.2 0.3
#[4,] 0.2 0.7 1.0 0.0 0.2 0.5
#[5,] 0.0 1.0 0.2 0.2 0.0 0.7
#[6,] 0.1 0.9 0.3 0.5 0.7 0.0

but then these two points have different distances from other points, e.g. d(1-2) is 0.1 whereas d(5-2) is 1?

Replacing the two 0's does not seem to help though:

d3 <- d2
d3[1,5] <- 0.2
d3[5,1] <- 0.2
dd3 <- cmdscale(as.matrix(d3),k=3)
sum(abs(as.matrix(dist(dd3))-as.matrix(d3)))
#[1] 7.168348

Does this perhaps indicate that not all distance matrices can be reduced to a completely consistent set of points, regardless of how many dimensions one uses?

EDIT 2 possible answer to the last question.

I suspect that the answer is yes. And I was wrong on the number of dimensions, I see now why you need N-1 rather than half that.

If I have a distance d(A-B) = 1, I can represent that in 2-1 = 1 dimensions (x axis), i.e. on a line, placing A in (xA=0) and B in (xB=1).

Then I introduce a third point C and I state that d(A-C) = 2.
I have 3 points, so I need 3-1 = 2 dimensions (xy plane).
The constraint given by d(A-C) is:

(xC - 0)^2 + (yC - 0)^2 = d(A-C)^2 = 4.

i.e. C can be anywhere on a circumference of radius 2 centred in A.
This constrains both xC and yC to be in [-2,2].

However, previously I had not considered that this constrains the possible values of d(B-C), too, because:

d(B-C)^2 = (xC - 1)^2 + (yC - 0)^2

thus, by substitution of the (yC - 0)^2 term:

d(B-C)^2 = (xC - 1)^2 + 4 - (xC - 0)^2 = -2*xC + 5

d(B-C)^2 is therefore bound to [-2*(+2)+5,-2*(-2)+5] = [1,9].

So if my distance matrix contained d(A-B) = 1, d(A-C) = 2 and d(B-C) anywhere outside [1,3], it would configure a system that does not correspond to 3 points in Euclidean space.

At least, I hope this makes sense.

So I guess my original question must be revoked.
I thought I'd leave the reasoning here for future reference or if anyone else should have the same doubt.

unexpected result from cmdscale in R

Answers (1)

Related Questions