Reputation: 2022

Simple matching similarity matrix for continuous, non-binary data?

Given the matrix

structure(list(X1 = c(1L, 2L, 3L, 4L, 2L, 5L), X2 = c(2L, 3L, 
4L, 5L, 3L, 6L), X3 = c(3L, 4L, 4L, 5L, 3L, 2L), X4 = c(2L, 4L, 
6L, 5L, 3L, 8L), X5 = c(1L, 3L, 2L, 4L, 6L, 4L)), .Names = c("X1", 
"X2", "X3", "X4", "X5"), class = "data.frame", row.names = c(NA, 
-6L))

I want to create a 5 x 5 distance matrix with the ratio of matches and the total number of rows between all columns. For instance, the distance between X4 and X3 should be 0.5, given that both columns match 3 out of 6 times.

I have tried using dist(test, method="simple matching") from package "proxy" but this method only works for binary data.

Upvotes: 1

Answers (5)

Werner Hertzog

Reputation: 2022

Thank you all for your suggestions. Based on your answers I elaborated a three line solution ("test" is the name of the dataset).

require(proxy)
ff <- function(x,y) sum(x == y) / NROW(x)
dist(t(test), ff, upper=TRUE)

Here is the output:

          X1        X2        X3        X4        X5
X1           0.0000000 0.0000000 0.0000000 0.3333333
X2 0.0000000           0.5000000 0.5000000 0.1666667
X3 0.0000000 0.5000000           0.5000000 0.0000000
X4 0.0000000 0.5000000 0.5000000           0.0000000
X5 0.3333333 0.1666667 0.0000000 0.0000000

Upvotes: 1

Dilnessa G.

Reputation: 11

I have got the answer as follows: 1st I have made some modifications on the row data as:

X1 = c(1L, 2L, 3L, 4L, 2L, 5L)
X2 = c(2L, 3L, 4L, 5L, 3L, 6L)
X3 = c(3L, 4L, 4L, 5L, 3L, 2L)
X4 = c(2L, 4L, 6L, 5L, 3L, 8L)
X5 = c(1L, 3L, 2L, 4L, 6L, 4L)
matrix_cor=rbind(x1,x2,x3,x4,x5)
matrix_cor

   [,1] [,2] [,3] [,4] [,5] [,6]
X1    1    2    3    4    2    5
X2    2    3    4    5    3    6
X3    3    4    4    5    3    2
X4    2    4    6    5    3    8
X5    1    3    2    4    6    4

then:

dist(matrix_cor)

     X1       X2       X3       X4
X2 2.449490                           
X3 4.472136 4.242641                  
X4 5.000000 3.000000 6.403124         
X5 4.358899 4.358899 4.795832 6.633250

Upvotes: 1

flodel

Reputation: 89057

Using outer (again :-)

my.dist <- function(x) {
 n <- nrow(x)
 d <- outer(seq.int(ncol(x)), seq.int(ncol(x)),
            Vectorize(function(i,j)sum(x[[i]] == x[[j]]) / n))
 rownames(d) <- names(x)
 colnames(d) <- names(x)
 return(d)
}

my.dist(x)
#           X1        X2  X3  X4        X5
# X1 1.0000000 0.0000000 0.0 0.0 0.3333333
# X2 0.0000000 1.0000000 0.5 0.5 0.1666667
# X3 0.0000000 0.5000000 1.0 0.5 0.0000000
# X4 0.0000000 0.5000000 0.5 1.0 0.0000000
# X5 0.3333333 0.1666667 0.0 0.0 1.0000000

Upvotes: 6

Chase

Reputation: 69151

Here's a solution that is faster than the other two, though a bit ugly. I assume the speed bumps come from not using mean() as it can be slow compared to sum(), and also only computing half of the output matrix and then filling the lower triangle manually. The function currently leaves NA on the diagonal, but you can easily set those to one to completely match the other answers with diag(out) <- 1

FUN <- function(m) {
  #compute all the combinations of columns pairs
  combos <- t(combn(ncol(m),2))
  #compute the similarity index based on the criteria defined
  sim <- apply(combos, 1, function(x) sum(m[, x[1]] - m[, x[2]] == 0) / nrow(m))
  combos <- cbind(combos, sim)
  #dimensions of output matrix
  out <- matrix(NA, ncol = ncol(m), nrow = ncol(m))

  for (i in 1:nrow(combos)){
    #upper tri
    out[combos[i, 1], combos[i, 2]] <- combos[i,3]
    #lower tri
    out[combos[i, 2], combos[i, 1]] <- combos[i,3]
  }
  return(out)
}

I took the other two answers, made them into functions, and did some benchmarking:

library(rbenchmark)
benchmark(chase(m), flodel(m), blindJessie(m), 
          replications = 1000,
          order = "elapsed", 
          columns = c("test", "elapsed", "relative"))
#-----
       test elapsed relative
1  chase(m)   1.217 1.000000
2 flodel(m)   1.306 1.073131
3 blindJessie(m)  17.691 14.548520

Upvotes: 2

Jesse Anderson

Reputation: 4603

Here's a shot at it (dt is your matrix):

library(reshape)
df = expand.grid(names(dt),names(dt))
df$val=apply(df,1,function(x) mean(dt[x[1]]==dt[x[2]]))
cast(df,Var2~Var1)

Upvotes: 2

Simple matching similarity matrix for continuous, non-binary data?

Answers (5)

Related Questions