Reputation: 71
I am trying to create a new distance function for my data. However, the performance of my code is very slow when compared to the dist function from stats package. For instance, see the results for the euclidean distance:
mydist = function (x){
euclidean = function (a, b){
sqrt(sum((a-b)^2))
}
distances = matrix(0, nrow=nrow(x), ncol=nrow(x))
for (i in 1:nrow(x))
for (j in 1:(i-1)){ # <- corrected this
if (j > 0){
distances[i,j]=euclidean(x[i,], x[j,])
distances[j,i]=distances[i,j]
}
}
distances
}
m=matrix(1:800, ncol=2)
system.time(as.dist(mydist(m)))
usuário sistema decorrido
0.714 0.000 0.716 # <- updated values with corrected version
system.time(dist(m))
usuário sistema decorrido
0.004 0.000 0.002
I will not use euclidean distance. I am developing a new one, much more complex using some statistics specific for my data, different from those of the proxy package, for instance. I have hundreds of variables and thousands of examples (lines) in the dataset. Can't wait a few hours just to compute the distance.
I have tried another code using outer with apply. It was faster than the two loops, but still very slow. Can anyone suggest anything?
Upvotes: 2
Views: 1420
Reputation: 18323
The key is to subtract each row from the entire matrix rather than each individual row. Since subtraction is done column-wise, simply transpose the matrix.
m=matrix(1:800, ncol=2)
system.time(a<-as.dist(mydist(m)))
# user system elapsed
# 1.32 0.00 1.32
t.m<-t(m)
system.time(x<-as.dist(apply(m,1,function(x) sqrt(colSums((x - t.m)^2)))))
# user system elapsed
# 0.04 0.00 0.03
any(x!=a) # FALSE
But if you really want speed, you should use a C library.
Upvotes: 2
Reputation: 14093
The key to speeding things up is
either your distance function can easily be vectorized. If that is the case, have a look at ? outer
, and/or ? rep
.
This approach can be quite fast, but also memory consuming.
apply
will reduce the two loops essentially into one, but real vectorization is usually much faster.
or you maybe want to use e.g. inline C code, see package inline.
you accidentally calculate twice as many distances as needed (you do the symmetric copying, but both i
and j
loop over the whole 1 : nrow (x)
).
Upvotes: 1