Reputation: 3339
EDIT: I think, from my discussion below with @joran , that @joran helped me figure out how dist
is altering the distance value (it appears to be scaling the sum of the squares of the coordinates by the value [total dimensions]/[non-missing dimensions], but that is just a guess). What I'd like to know, if anyone does know, are: is that what is really going on? If so, why is that considered a reasonable thing to do? Can there, or should there be options to dist
to compute it the way I proposed (that question might be to vague or of an opinionated nature to answer, though).
I was wondering how the dist
function actually works on vectors that have missing values. Below is a recreated example. I use the dist
function and a more fundamental implementation of what I believe should be the definition of Euclidian distance with sqrt, sum, and powers. I also expected that if a component of either vector was NA
, that that dimension would just be thrown out of the sum, which is how I implemented it. But you can see that that definition doesn't agree with dist
.
I will be using my basic implementation to handle the NA
values, but I was wondering how dist
is actually arriving at a value when vectors have NA
, and why it doesn't agree with how I calculate it below. I would think that my basic implementation should be the default/common one, and I can't figure out what alternate method dist
is using to get what it is getting.
Thanks, Matt
v1 <- c(1,1,1)
v2 <- c(1,2,3)
v3 <- c(1,NA,3)
# Agree on vectors with non-missing components
# --------------------------------------------
dist(rbind(v1, v2))
# v1
# v2 2.236068
sqrt(sum((v1 - v2)^2, na.rm=TRUE))
# [1] 2.236068
# But they don't agree when there is a missing component
# Under what logic does sqrt(6) make sense as the answer for dist?
# --------------------------------------------
dist(rbind(v1, v3))
# v1
# v3 2.44949
sqrt(sum((v1 - v3)^2, na.rm=TRUE))
# [1] 2
Upvotes: 6
Views: 2739
Reputation: 89097
Yes, the scaling happens exactly like you described. Maybe this is a better example:
set.seed(123)
v1 <- sample(c(1:3, NA), 100, TRUE)
v2 <- sample(c(1:3, NA), 100, TRUE)
dist(rbind(v1, v2))
# v1
# v2 12.24745
na.idx <- is.na(v1) | is.na(v2)
v1a <- v1[!na.idx]
v2a <- v2[!na.idx]
sqrt(sum((v1a - v2a)^2) * length(v1) / length(v1a))
# [1] 12.24745
The scaling makes sense to me. All things being equal, the distance increases as the number of dimensions increases. If somewhere you have a NA
for dimension i
, a reasonable guess for the contribution of dimension i
to the squared sum is the mean contribution of all other dimensions. Hence the linear up-scaling.
While you are suggesting that when you find a NA
for dimension i
, that dimension should not contribute to the squared sum. It is like assuming that v1[i] == v2[i]
which is totally different.
To summarize dist
is doing some type of maximum-likelihood estimation, while your suggestion is more like a worst (or best) case scenario.
Upvotes: 10