aromatic6tet
aromatic6tet

Reputation: 111

Rcpp Function filling matrix with different values

I'm building a process which will instantiate a NumericMatrix and fill it with Sorenson-Dice similarity coefficients, a similarity matrix. The matrix itself is of variable dimensions, and depends on the number of elements being processed. Generally there are more than 100 individual elements that are compared at any time (so the matrix dimensions will typically be 100+ by 100+). What I've built so far will create the matrix, and calculate the coefficient, then fill the matrix with those calculated values. However when I run the function repeatedly, I notice that values within the matrix change between each run, which is not expected behavior, since the data being compared is not changing or re-sorting between each run. I also get similarities greater than 1, which should definitely not be happening. I have four functions, one to find the numerator of the coefficient, one to find the denominator, one to use the numerator and denominator functions to calculate the coefficient, and the fourth to put the coefficients in the matrix.

Here's the c++ code:

// function to calculate the denominator of the dice coefficient
int diceDenomcpp(NumericVector val1, NumericVector val2){
  
  
  int val1Len = na_omit(val1).size();
  int val2Len = na_omit(val2).size();
  int bands = 0;
  
  
  bands = val1Len + val2Len;
  // return the computed total data points within both arrays
  
  
  return bands;
}

//######################################################################
//######################################################################
//######################################################################

// function to calculate the numerator for the dice coefficient
int diceNumcpp(NumericVector iso1, NumericVector iso2){
  
  // declare and initialize vectors with the element band data
  // remove any NA values within each vector
  NumericVector is1 = na_omit(iso1);
  NumericVector is2 = na_omit(iso2);
  
  // declare and initialize some counter variables
  int n = 0;
  int m = 0;
  int match = 0;
  
  // loop through the first element's first datum and check for matching datum
  // with the second element then continue to loop through each datum within each element 
  while (n<=is1.size()){
    if (m>=is2.size()){
      n++;
      m=0;
    }
    // if a suitable match is found, increment the match variable
    if((fabs(is1[n]-is2[m])/is1[n])<0.01 && (fabs(is1[n]-is2[m])/is2[m])<0.01){
      match++;
      
    }
    m++;
  }
  return match;
}

//########################################################################
//########################################################################
//########################################################################

// function to put the coefficient together
double diceCoefcpp(NumericVector val1, NumericVector val2){
  
  NumericVector is1 = clone(val1);
  NumericVector is2 = clone(val2);
  double dVal;
  double num = 2*diceNumcpp(is1, is2);
  double denom = diceDenomcpp(is1, is2);
  
  dVal = num/denom;
  
  return dVal;
  
}

//#######################################################################
//#######################################################################
//#######################################################################


// function to build the similarity matrix with the coefficients

NumericMatrix simMatGencpp(NumericMatrix df){
  
  // clone the input data frame
  NumericMatrix rapdDat = clone(df);

  // create a data frame for the output 
  NumericMatrix simMat(rapdDat.nrow(),rapdDat.nrow());
    std::fill(simMat.begin(), simMat.end(), NumericVector::get_na());
  
  // declare and initialize the iterator
  int i = 0;

  // declare and initialize the column counter
  int col = 0;  
  
  // declare an initialize the isolate counter
  int iso = 0;
  
  //simMat(_,0)=rapdDat(_,0);
  
  while (iso < rapdDat.nrow()){
    if (iso+i > rapdDat.nrow()){
      col++;
      i=0;
      iso++;
    }
    if (iso+i < rapdDat.nrow()){
      simMat(iso+i, col) = diceCoefcpp(rapdDat(iso,_), rapdDat(iso+i,_));
      
    }
    i++;
  }
  
  
  //Rcout << "SimMatrix:" << simMat << "\n";
  
  return simMat;
}

Here's a sample of what the input data should look like . . .

sampleData

    band1  band2  band3  band4  band5  band6
1   593.05 578.04 439.01     NA     NA     NA
2   589.07 567.03     NA     NA     NA     NA
3   591.04 575.10 438.12     NA     NA     NA
4   591.04     NA     NA     NA     NA     NA
5   588.08 573.18     NA     NA     NA     NA
6   591.04 576.09 552.10     NA     NA     NA
7  1805.00 949.00 639.19 589.07 576.09 440.06
8   952.00 588.08 574.14 550.04     NA     NA
9  1718.00 576.09 425.01     NA     NA     NA
10 1708.00 577.05 425.01     NA     NA     NA

With a small enough data set, the output simMatGencpp() function will produce the same results each time, however when the data set gets larger that's when the values will start to change from run to run.

I've tried running the diceNumcpp(), diceDenomcpp(), and diceCoefcpp() functions independently on individual elements, and was getting the expected output consistently each time. Once I use simMatGencpp() however then the output gets screwy again. So I tried to loop each individual function like below.

Example:

for(i in 1:100){
  print(diceNumcpp(sampleData[7,], sampleData[3,]))
}

The expected output from above should be 3, but sometimes it's 4. Each time I run this loop whichever time 4 is the output varies, sometimes the second iteration, sometimes the 14th, or none at all, or three times in a row.

My first thought is that maybe since garbage collection doesn't exactly occur in c++ that perhaps the previously run function call is leaving the old vector in memory since the name of the output object isn't changing from run to run. But then this post says that when the function exits any object created within the scope of the function call is destroyed as well.

When I code the same solution in R-code only, the runtime sucks, but it will consistently return a matrix or the example vector with the same values each time.

I'm at a loss. Any help or light anyone could shed on this subject would be greatly received!

Thanks for your help.

Update 2020-08-19

I'm hoping that this will help provide some insight for the more well-versed c++ people out there so that you may have some additional ideas about what may be happening. I have some sample data, similar to what is shown above, that is 187 rows long, meaning that a similarity matrix of these data would have 17578 elements. I've been running comparisons between the R version of this solution and the c++ version of this solution, using code like this, with the sample data:

# create the similarity matrix with the R-solution to compare iteratively
# with another R-solution similarity matrix
simMat1 <- simMatGen(isoMat)
resultsR <- c()
for(i in 1:100){
  
  simMat2 <- simMatGen(isoMat)

  # check for any mis-matched elements in each matrix
  resultsR[[i]]<-length(which(simMat1 == simMat2)==TRUE)

  #######################################################################
  # everytime this runs I get the expected number of true values 17578
  # and check this by subtracting the mean(resultsR) from the expected 
  # number of true values of 17578 
}

mean(resultsR)

Now when i do this same process with the c++ version things change drastically and quickly. I tried this with both 64 and 32 bit R-3.6.0, just because.

simMat1 <- simMatGen(isoMat)
isoMat <- as.matrix(isoMat)
resultscpp <- c()
for(i in 1:10000){
  
  simMat2 <- simMatGencpp(isoMat)
  resultscpp[[i]]<-length(which(simMat1 == simMat2)==TRUE)

  ############  64 bit R  ##############
  # first iteration length(which(simMat1 == simMat2)==TRUE)-17578 equals 2
  # second iteration 740 elements differ: length(which(simMat1 == simMat2)==TRUE)-17578 equals 740 
  # third iteration 1142 elements differ
  # after 100 iterations the average difference is 2487.7 elements
  # after 10000 iterations the average difference is 2625.91 elements
  
  ############  32 bit R  ##############
  # first iteration difference = 1
  # second iteration difference = 694
  # 100 iterations difference = 2520.94
  # 10000 iterations difference = 2665.04
}

mean(resultscpp)

Here's sessionInfo()

R version 3.6.0 (2019-04-26)
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows 10 x64 (build 17763)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.5        rstudioapi_0.10   magrittr_1.5      usethis_1.5.0     devtools_2.1.0    pkgload_1.0.2     R6_2.4.0          rlang_0.4.4      
 [9] tools_3.6.0       pkgbuild_1.0.3    sessioninfo_1.1.1 cli_1.1.0         withr_2.1.2       remotes_2.1.0     assertthat_0.2.1  digest_0.6.20    
[17] rprojroot_1.3-2   crayon_1.3.4      processx_3.3.1    callr_3.2.0       fs_1.3.1          ps_1.3.0          testthat_2.3.1    memoise_1.1.0    
[25] glue_1.3.1        compiler_3.6.0    desc_1.2.0        backports_1.1.5   prettyunits_1.0.2

Upvotes: 1

Views: 575

Answers (1)

aromatic6tet
aromatic6tet

Reputation: 111

Made a rookie c++ mistake here.

In the diceNumcpp() I didn't put any checks in place so that I don't accidentally reference an out-of-bounds element in the array.

// if a suitable match is found, increment the match variable
    if((fabs(is1[n]-is2[m])/is1[n])<0.01 && (fabs(is1[n]-is2[m])/is2[m])<0.01){
      match++;
}

was changed to:

// if a suitable match is found, increment the match variable
    if(n<=(is1.size()-1) && (m<=is2.size()-1)){ // <- here need to make sure it stays inbounds 
     if((fabs(is1[n]-is2[m])/is1[n])<0.01 && (fabs(is1[n]-is2[m])/is2[m])<0.01){
       match++;
     }
    }

and after running it 1000 times was able to get correct results every time.

Learn something new everyday.

Cheers.

Upvotes: 2

Related Questions