Reputation: 3116
I am trying to calculate Gower's similarity between a set of items. Using Rcpp package I am writing my own function to calculate the similarity value as with larger data the daisy function throws an error.
The function is :
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
List gowerSim(CharacterMatrix inp) {
int n_row = inp.nrow(), n_col = inp.ncol();
int sumRow = 0, colLen;
List out(n_row);
//double sim[n_row];
NumericVector sim(n_row);
for (int i = 0; i < n_row; i++) {
for (int j = 0; j < n_row; j++) {
sumRow = 0;
colLen = n_col;
for (int k = 0; k < n_col; k++) {
if (inp(i,k) != "NA" && inp(j,k) != "NA") {
if (inp(i,k) != inp(j,k)) {
sumRow = sumRow + 1;
}
} else {
colLen = colLen - 1;
}
}
if (colLen > 0) {
sim[j] = (double) sumRow/colLen;
//printf("%f",sim[j]);
} else {
sim[j] = NA_INTEGER;
}
}
out[i] = sim;
if (i < 3) {
print(out);
}
}
return out;
}
/*** R
clust<-gowerSim(inp)
*/
The returned list has the last vector copied to all the other elements, i.e, suppose if clust
has length 250, clust[[1]]
and clust[[250]]
have all the values same. However, while printing (for top 3 elements) each vector out[1]
, out[2]
, out[3]
is different.
Can anybody please tell what is the issue here?
Upvotes: 1
Views: 1491
Reputation: 10352
The solution for this problem is to define the vector sim
after the first for
command, like this:
List gowerSim(CharacterMatrix inp) {
int n_row = inp.nrow(), n_col = inp.ncol();
int sumRow=0,colLen;
List out(n_row);
for(int i=0;i<n_row;i++){
NumericVector sim(n_row);
for(int j=0;j<n_row;j++){
sumRow=0;
colLen=n_col;
for(int k=0; k<n_col;k++){
if(inp(i,k)!="NA" && inp(j,k)!="NA"){
if(inp(i,k)!=inp(j,k)){
sumRow=sumRow+1;
}
}else{
colLen=colLen-1;
}
}
if(colLen>0){
sim[j] = (double) sumRow/colLen;
//printf("%f",sim[j]);
}else{
sim[j] = NA_INTEGER;
}
}
out[i] = sim;
if(i<3){
print(out);
}
}
return out;
}
A little example:
mat <- matrix( as.character(c(rep(1,5),sample(3,15,repl=TRUE),rep(5,5))),5)
clust <- gowerSim(mat)
clust
Or you can define the vector as you did it and reset it in the first for-loop.
Why exactly this approach works and your not: I don't really know, but I think it is referred to the list structure in C++.
My first approach to solve your problem was the following one: Instead filling up a list, we are filling a Matrix, and this works fine, see here:
NumericMatrix gowerSim(CharacterMatrix inp) {
int n_row = inp.nrow(), n_col = inp.ncol();
int sumRow=0,colLen;
NumericMatrix out(n_row, n_col);
NumericVector sim(n_row);
for(int i=0;i<n_row;i++);
for(int j=0;j<n_row;j++){
sumRow=0;
colLen=n_col;
for(int k=0; k<n_col;k++){
if(inp(i,k)!="NA" && inp(j,k)!="NA"){
if(inp(i,k)!=inp(j,k)){
sumRow=sumRow+1;
}
}else{
colLen=colLen-1;
}
}
if(colLen>0){
sim[j] = (double) sumRow/colLen;
//printf("%f",sim[j]);
}else{
sim[j] = NA_INTEGER;
}
}
out(_,i) = sim;
if(i<3){
print(out);
}
}
return out;
}
Upvotes: 3