Reputation: 61
I am tasked with creating a distance matrix function based on a custom defined distance measure. The distance measure is as follows:
wabs_dist = function(u, v, w){
return( sum((abs(u-v))*w) )
}
Where u and v are vectors and w is a weight.
The problem I am to solve:
I am to create a distance matrix function create-dm(x,w) that returns a distance matrix for the objects in dataframe x by calling the wabs-dist(a,b,w) for all pairs of objects a and b belonging to x. If x is a data set with 4 attributes then w is a vector e.g w = c(1,1,3,2) assigned to each attribute. Yes there are already standard functions like dist() but I am to create my own here using the wabs_dist.
My solution so far:
create_dm = function(x, w){ #x is a dataframe
distances = matrix(0, nrow = nrow(x), ncol = nrow(x))
for (i in 1:nrow(x)) {
for(j in 1:(i-1)){
distances[i, j] = wabs_dist(x[i,], x[j,], w)
distances[j, i] = distances[i, j]
}
}
return(distances)
}
How do i implement a vector of weights because i wrote this function with the mindset of passing in just one weight but now i have to write it to accept a list. How do i do implement this function using the list of weights?
This function takes A LOT of time to run. In fact it never actually prints out the distance matrix function. I cant figure out why
An example:
Let x be a data frame containing vectors a, b and c where: a: (1, 2) b: (4, 5) c: (9, 12)
w is weight vector: (0.2, 0.3)
wabs-dist(a,b,w) = 1.5 wabs-dist(b,c,w) = 3.1
create-dm(x,w)=
0 1.5 4.6
1.5 0 3.1
4.6 3.1 0
Upvotes: 2
Views: 1711
Reputation: 51
I had a similar problem lately. My final solution was to write it in C++ with the Rcpp package. Save this Code as dmat.cpp
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericMatrix dmat(NumericMatrix x, NumericVector w) {
int n = x.nrow();
NumericMatrix d = no_init_matrix(n, n);
for(int i=0; i<n;i++){
for(int j=i+1; j<n;j++){
d(i,j)=sum(w*abs((x(i,_)-x(j,_))));
d(j,i)=d(i,j);
}
d(i,i)=0;
}
return d;
}
Then install and load the package "Rcpp" and use sourceCpp()
to load the function. After that you can use it like any other R function
library(Rcpp)
sourceCpp("path/to/file/dmat.cpp")
x <- matrix(rnorm(1500),ncol=3)
w <- 1:3
system.time(distR <- create_dm(x,w))
User System verstrichen
1.81 0.02 1.84
system.time(distCpp <- dmat(x,w))
User System verstrichen
0 0 0
identical(round(distR,10), round(distCpp,10))
[1] TRUE
If you just use identical()
without rounding it gives FALSE. Don't know why. Maybe this can be answered by someone else.
If you can use the euclidean distance instead of absolute distance you could use the package apcluster
. This was my first solution. But the C++ solution was still faster .
Upvotes: 1