mpettis
mpettis

Reputation: 3339

Function `dist` not behaving as expected on vectors with missing values

EDIT: I think, from my discussion below with @joran , that @joran helped me figure out how dist is altering the distance value (it appears to be scaling the sum of the squares of the coordinates by the value [total dimensions]/[non-missing dimensions], but that is just a guess). What I'd like to know, if anyone does know, are: is that what is really going on? If so, why is that considered a reasonable thing to do? Can there, or should there be options to dist to compute it the way I proposed (that question might be to vague or of an opinionated nature to answer, though).

I was wondering how the dist function actually works on vectors that have missing values. Below is a recreated example. I use the dist function and a more fundamental implementation of what I believe should be the definition of Euclidian distance with sqrt, sum, and powers. I also expected that if a component of either vector was NA, that that dimension would just be thrown out of the sum, which is how I implemented it. But you can see that that definition doesn't agree with dist.

I will be using my basic implementation to handle the NA values, but I was wondering how dist is actually arriving at a value when vectors have NA, and why it doesn't agree with how I calculate it below. I would think that my basic implementation should be the default/common one, and I can't figure out what alternate method dist is using to get what it is getting.

Thanks, Matt

v1 <- c(1,1,1)
v2 <- c(1,2,3)
v3 <- c(1,NA,3)

# Agree on vectors with non-missing components
# --------------------------------------------
dist(rbind(v1, v2))
#          v1
# v2 2.236068

sqrt(sum((v1 - v2)^2, na.rm=TRUE))
# [1] 2.236068



# But they don't agree when there is a missing component
# Under what logic does sqrt(6) make sense as the answer for dist?
# --------------------------------------------
dist(rbind(v1, v3))
#         v1
# v3 2.44949

sqrt(sum((v1 - v3)^2, na.rm=TRUE))
# [1] 2

Upvotes: 6

Views: 2739

Answers (1)

flodel
flodel

Reputation: 89097

Yes, the scaling happens exactly like you described. Maybe this is a better example:

set.seed(123)
v1 <- sample(c(1:3, NA), 100, TRUE)
v2 <- sample(c(1:3, NA), 100, TRUE)

dist(rbind(v1, v2))
#          v1
# v2 12.24745

na.idx <- is.na(v1) | is.na(v2) 
v1a  <- v1[!na.idx]
v2a  <- v2[!na.idx]

sqrt(sum((v1a - v2a)^2) * length(v1) / length(v1a))
# [1] 12.24745

The scaling makes sense to me. All things being equal, the distance increases as the number of dimensions increases. If somewhere you have a NA for dimension i, a reasonable guess for the contribution of dimension i to the squared sum is the mean contribution of all other dimensions. Hence the linear up-scaling.

While you are suggesting that when you find a NA for dimension i, that dimension should not contribute to the squared sum. It is like assuming that v1[i] == v2[i] which is totally different.

To summarize dist is doing some type of maximum-likelihood estimation, while your suggestion is more like a worst (or best) case scenario.

Upvotes: 10

Related Questions