user1348438
user1348438

Reputation: 71

Optimize my own distance function in R

I am trying to create a new distance function for my data. However, the performance of my code is very slow when compared to the dist function from stats package. For instance, see the results for the euclidean distance:

mydist = function (x){

  euclidean = function (a, b){
    sqrt(sum((a-b)^2))
  }

  distances = matrix(0, nrow=nrow(x), ncol=nrow(x))
  for (i in 1:nrow(x))
    for (j in 1:(i-1)){  # <- corrected this
      if (j > 0){
         distances[i,j]=euclidean(x[i,], x[j,])
         distances[j,i]=distances[i,j]
      }
    }

  distances
}


m=matrix(1:800, ncol=2)
system.time(as.dist(mydist(m)))
  usuário   sistema decorrido 
    0.714     0.000     0.716      # <- updated values with corrected version

system.time(dist(m))
  usuário   sistema decorrido 
    0.004     0.000     0.002 

I will not use euclidean distance. I am developing a new one, much more complex using some statistics specific for my data, different from those of the proxy package, for instance. I have hundreds of variables and thousands of examples (lines) in the dataset. Can't wait a few hours just to compute the distance.

I have tried another code using outer with apply. It was faster than the two loops, but still very slow. Can anyone suggest anything?

Upvotes: 2

Views: 1420

Answers (2)

nograpes
nograpes

Reputation: 18323

The key is to subtract each row from the entire matrix rather than each individual row. Since subtraction is done column-wise, simply transpose the matrix.

m=matrix(1:800, ncol=2)
system.time(a<-as.dist(mydist(m)))
# user  system elapsed
# 1.32    0.00    1.32 

t.m<-t(m)
system.time(x<-as.dist(apply(m,1,function(x) sqrt(colSums((x - t.m)^2)))))
# user  system elapsed
# 0.04    0.00    0.03 

any(x!=a) # FALSE

But if you really want speed, you should use a C library.

Upvotes: 2

cbeleites
cbeleites

Reputation: 14093

The key to speeding things up is

  • either your distance function can easily be vectorized. If that is the case, have a look at ? outer, and/or ? rep.
    This approach can be quite fast, but also memory consuming.

  • apply will reduce the two loops essentially into one, but real vectorization is usually much faster.

  • or you maybe want to use e.g. inline C code, see package inline.

  • you accidentally calculate twice as many distances as needed (you do the symmetric copying, but both i and j loop over the whole 1 : nrow (x)).

Upvotes: 1

Related Questions