Reputation: 89
Today I was trying to debug my code and stumbled across something that renders my solutions useless. What i am generally trying to calculate is the multidimensional L2-Norm for the following two matrices. As long as I am not using scale() everything is working fine. Nonetheless, as soon as I scale the matrices the solutions of the three used approaches are not the same anymore. What am I missing here?
set.seed(655)
df.a <- data.frame(A = sample(100:124, 24), B = sample(1:24, 24), C = sample(1:24, 24), D = rep(0, times=24))
df.b <- data.frame(A = sample(125:148, 24), B = sample(25:48, 24), C = sample(1:24, 24), D = sample(1:100, 24))
For this reason I have three different approaches:
sapply-function and sqrt of rowSums
sse <- function(x1, x2) sum((x1 - x2) ^ 2)
distanceChangeByTech <- function(x) {
sse(df.a[,x], df.b[,x])
}
help1 <- t(data.frame(sapply(colnames(df.a), distanceChangeByTech)))
dist_sap <- sqrt(rowSums(help1))
multidimensional Euclidean distance using RCPP:
multiEucl <- cxxfunction(signature(x="matrix", y="matrix"), plugin="Rcpp",
body='
Rcpp::NumericMatrix dx(x);
Rcpp::NumericMatrix dy(y);
const int N = dx.nrow();
const int M = dx.ncol();
double sum = 0;
for(int i=0; i<N; i++){
for(int j=0; j<M; j++){
sum = sum + pow(dx(i,j) - dy(i,j), 2);
}
}
return wrap(sqrt(sum));
')
multidimensional Lp-Norm using RCPP:
multiPNorm <- cxxfunction(signature(x="matrix", y="matrix", p="numeric"), plugin="Rcpp",
body='
Rcpp::NumericMatrix dx(x);
Rcpp::NumericMatrix dy(y);
double dp = Rcpp::as<double>(p);
const int N = dx.nrow();
const int M = dx.ncol();
double sum = 0;
double rsum = 0;
for(int i=0; i<N; i++){
for(int j=0; j<M; j++){
sum = sum + pow(abs(dx(i,j) - dy(i,j)), dp);
}
}
rsum = pow(sum, 1/dp);
return wrap(rsum);
')
When I tried this at first all worked well.
> multiEucl(as.matrix(df.a), as.matrix(df.b))
[1] 366.1543
> multiPNorm(as.matrix(df.a), as.matrix(df.b), 2)
[1] 366.1543
> sqrt(rowSums(help1)) sapply.colnames.df.a...distanceChangeByTech.
366.1543
But as soon as I scale the matrices, which I want to do because I will do a Clustering based on these distancemeasures, there is a fault. The solutions are not the same anymore?! What is causing this? I am using these commands to scale.
df.a <- as.data.frame(scale(df.a))
df.a[is.na(df.a)] <- 0
df.b <- as.data.frame(scale(df.b))
df.b[is.na(df.b)] <- 0
> multiEucl(as.matrix(df.a), as.matrix(df.b))
[1] 12.51781
> multiPNorm(as.matrix(df.a), as.matrix(df.b), 2)
[1] 8.944272
> sqrt(rowSums(help1))
sapply.colnames.df.a...distanceChangeByTech.
12.51781
Upvotes: 0
Views: 127
Reputation: 368509
You used abs()
which is documented eg here but you meant to use fabs()
which is documented here.
The cmath.h
header provides overloaded abs()
as well, but you probably didn't include that.
Upvotes: 2
Reputation: 89
It seems that abs() is not doing the right thing here. Instead I changed my coding of the multiPNorm and the changes seem to work.
multiPNorm <- cxxfunction(signature(x="matrix", y="matrix", p="numeric"), plugin="Rcpp",
body='
Rcpp::NumericMatrix dx(x);
Rcpp::NumericMatrix dy(y);
double dp = Rcpp::as<double>(p);
const int N = dx.nrow();
const int M = dx.ncol();
double sum = 0;
double rsum = 0;
double help = 0;
for(int i=0; i<N; i++){
for(int j=0; j<M; j++){
help = dx(i,j) - dy(i,j);
if (help < 0) {
help = - help;
}
sum = sum + pow(help, dp);
}
}
rsum = pow(sum, 1/dp);
return wrap(rsum);
')
Upvotes: 0