Comparing distribution of two vectors

I have 5 different vectors and then a vector I want to compare them to. What I need is to get the most similiar vector out of the 5 different ones.

The vectors are quite long, so I will just show a little of it:

# Vector to compare to:
v1 <- c(0.2500, 0.4375, 0.1250, 0.3125, 0.0000, 0.5625, 0.1250, 0.1875, 0.1875, 0.1875, 0.1875)

# One of vectors to compare
v2 <- c(2, 1, 0, 1, 1, 1, 1, 0, 2, 1, 2)

# Another of vectors to compare: 
v3 <- c(5, 0, 3, 1, 1, 2, 1, 2, 0, 1, 2)

Practically what I need to do is a statistical test to compare the distribution of histograms given by those vectors and tell which is the closest. I tried to use ks.test, but it had a problem with duplicate values in vectors and p-value returned was like 0.0000000000001.. Any ideas how to do that (except visually)?

Upvotes: 2

Views: 2092

Answers (1)

Ben Bolker
Ben Bolker

Reputation: 226172

It's not clear to me why you need a statistical test if all you want to do is compute which one is closest. Below I'm just computing the histograms directly and comparing their distances.

Generate data:

v1 <- c(0.2500, 0.4375, 0.1250, 0.3125, 0.0000, 0.5625, 0.1250,
   0.1875, 0.1875, 0.1875, 0.1875)
v2 <- c(2, 1, 0, 1, 1, 1, 1, 0, 2, 1, 2)*0.1
v3 <- c(5, 0, 3, 1, 1, 2, 1, 2, 0, 1, 2)*0.1

Note that I changed vectors 2 and 3 a little bit so their distributions would actually overlap with the comparison vector

vList <- list(v1,v2,v3)
brkvec <- seq(0,0.7,by=0.1)
hList <- lapply(vList,function(x)
     hist(x,plot=FALSE, breaks=brkvec)$counts )

This is a little bit inefficient because it computes all of the pairwise distances and then throws most of them away ...

dmat <- dist(do.call(rbind,hList))
dvec <- as.matrix(dmat)[-1,1]
##        2        3 
## 7.874008 6.000000 

The other option would be to ignore the warning from ks.test() (since it only affects inference, not the computation of the distance statistic)

ks.dist <- sapply(vList[-1],
        function(x) suppressWarnings(ks.test(v1,x)$statistic))
ks.dist
##         D         D 
## 0.6363636 0.4545455

The results match (i.e., v3 is closer to v1 than v2 is)

Upvotes: 2

Related Questions