Reputation: 2022
Given the matrix
structure(list(X1 = c(1L, 2L, 3L, 4L, 2L, 5L), X2 = c(2L, 3L,
4L, 5L, 3L, 6L), X3 = c(3L, 4L, 4L, 5L, 3L, 2L), X4 = c(2L, 4L,
6L, 5L, 3L, 8L), X5 = c(1L, 3L, 2L, 4L, 6L, 4L)), .Names = c("X1",
"X2", "X3", "X4", "X5"), class = "data.frame", row.names = c(NA,
-6L))
I want to create a 5 x 5 distance matrix with the ratio of matches and the total number of rows between all columns. For instance, the distance between X4 and X3 should be 0.5, given that both columns match 3 out of 6 times.
I have tried using dist(test, method="simple matching")
from package "proxy" but this method only works for binary data.
Upvotes: 1
Views: 2115
Reputation: 2022
Thank you all for your suggestions. Based on your answers I elaborated a three line solution ("test" is the name of the dataset).
require(proxy)
ff <- function(x,y) sum(x == y) / NROW(x)
dist(t(test), ff, upper=TRUE)
Here is the output:
X1 X2 X3 X4 X5
X1 0.0000000 0.0000000 0.0000000 0.3333333
X2 0.0000000 0.5000000 0.5000000 0.1666667
X3 0.0000000 0.5000000 0.5000000 0.0000000
X4 0.0000000 0.5000000 0.5000000 0.0000000
X5 0.3333333 0.1666667 0.0000000 0.0000000
Upvotes: 1
Reputation: 11
I have got the answer as follows: 1st I have made some modifications on the row data as:
X1 = c(1L, 2L, 3L, 4L, 2L, 5L)
X2 = c(2L, 3L, 4L, 5L, 3L, 6L)
X3 = c(3L, 4L, 4L, 5L, 3L, 2L)
X4 = c(2L, 4L, 6L, 5L, 3L, 8L)
X5 = c(1L, 3L, 2L, 4L, 6L, 4L)
matrix_cor=rbind(x1,x2,x3,x4,x5)
matrix_cor
[,1] [,2] [,3] [,4] [,5] [,6]
X1 1 2 3 4 2 5
X2 2 3 4 5 3 6
X3 3 4 4 5 3 2
X4 2 4 6 5 3 8
X5 1 3 2 4 6 4
then:
dist(matrix_cor)
X1 X2 X3 X4
X2 2.449490
X3 4.472136 4.242641
X4 5.000000 3.000000 6.403124
X5 4.358899 4.358899 4.795832 6.633250
Upvotes: 1
Reputation: 89057
Using outer
(again :-)
my.dist <- function(x) {
n <- nrow(x)
d <- outer(seq.int(ncol(x)), seq.int(ncol(x)),
Vectorize(function(i,j)sum(x[[i]] == x[[j]]) / n))
rownames(d) <- names(x)
colnames(d) <- names(x)
return(d)
}
my.dist(x)
# X1 X2 X3 X4 X5
# X1 1.0000000 0.0000000 0.0 0.0 0.3333333
# X2 0.0000000 1.0000000 0.5 0.5 0.1666667
# X3 0.0000000 0.5000000 1.0 0.5 0.0000000
# X4 0.0000000 0.5000000 0.5 1.0 0.0000000
# X5 0.3333333 0.1666667 0.0 0.0 1.0000000
Upvotes: 6
Reputation: 69151
Here's a solution that is faster than the other two, though a bit ugly. I assume the speed bumps come from not using mean()
as it can be slow compared to sum()
, and also only computing half of the output matrix and then filling the lower triangle manually. The function currently leaves NA
on the diagonal, but you can easily set those to one to completely match the other answers with diag(out) <- 1
FUN <- function(m) {
#compute all the combinations of columns pairs
combos <- t(combn(ncol(m),2))
#compute the similarity index based on the criteria defined
sim <- apply(combos, 1, function(x) sum(m[, x[1]] - m[, x[2]] == 0) / nrow(m))
combos <- cbind(combos, sim)
#dimensions of output matrix
out <- matrix(NA, ncol = ncol(m), nrow = ncol(m))
for (i in 1:nrow(combos)){
#upper tri
out[combos[i, 1], combos[i, 2]] <- combos[i,3]
#lower tri
out[combos[i, 2], combos[i, 1]] <- combos[i,3]
}
return(out)
}
I took the other two answers, made them into functions, and did some benchmarking:
library(rbenchmark)
benchmark(chase(m), flodel(m), blindJessie(m),
replications = 1000,
order = "elapsed",
columns = c("test", "elapsed", "relative"))
#-----
test elapsed relative
1 chase(m) 1.217 1.000000
2 flodel(m) 1.306 1.073131
3 blindJessie(m) 17.691 14.548520
Upvotes: 2
Reputation: 4603
Here's a shot at it (dt is your matrix):
library(reshape)
df = expand.grid(names(dt),names(dt))
df$val=apply(df,1,function(x) mean(dt[x[1]]==dt[x[2]]))
cast(df,Var2~Var1)
Upvotes: 2