Reputation: 6713
@user3759195 wrote a post https://stackoverflow.com/questions/24322356/rstudio-crashes-and-it-does-not-reproduce about RStudio crashing with RCpp, but didn't give any reproducible case. @KevinUshey mentioned in the comments that we have to PROTECT
the wrap
within the code.
I took the liberty of posting two alternatives to split.data.frame
function written in RCpp:
* VERSION THAT DOES NOT CRASH RSTUDIO *
//[[Rcpp::export]]
List splitDataFrameCpp(DataFrame x,NumericVector y) {
int nRows=x.nrows();
int nCols=x.size();
std::map<double,vector<double> > z;
for (int i=0;i<nCols;i++) {
std::vector<double> tmp=Rcpp::as<std::vector<double> > (x[i]);
for (int j=0;j<nRows;j++) {
z[y[j]].push_back(tmp[j]);
}
}
std::vector<double> yunq=Rcpp::as<std::vector<double> > (sort_unique(y));
std::map<double, DataFrame> z1;
for (int i=0;i<int(yunq.size());i++) {
NumericVector tmp1=wrap(z[yunq[i]]); // *** DEFINING INSIDE LOOP ***
tmp1.attr("dim")=Dimension(int(tmp1.size())/nCols,nCols);
DataFrame tmp2(wrap(tmp1)); // *** DEFINING INSIDE LOOP ***
tmp2.attr("names")=x.attr("names");
z1[yunq[i]]=tmp2;
}
return wrap(z1);
}
* VERSION THAT CRASHES RSTUDIO *
//[[Rcpp::export]]
List splitDataFrameCpp(DataFrame x,NumericVector y) {
int nRows=x.nrows();
int nCols=x.size();
std::map<double,vector<double> > z;
for (int i=0;i<nCols;i++) {
std::vector<double> tmp=Rcpp::as<std::vector<double> > (x[i]);
for (int j=0;j<nRows;j++) {
z[y[j]].push_back(tmp[j]);
}
}
std::vector<double> yunq=Rcpp::as<std::vector<double> > (sort_unique(y));
std::map<double, DataFrame> z1;
NumericVector tmp1; // *** DEFINING OUTSIDE LOOP ***
DataFrame tmp2; // *** DEFINING OUTSIDE LOOP ***
for (int i=0;i<int(yunq.size());i++) {
tmp1=wrap(z[yunq[i]]);
tmp1.attr("dim")=Dimension(int(tmp1.size())/nCols,nCols);
tmp2=wrap(tmp1);
tmp2.attr("names")=x.attr("names");
z1[yunq[i]]=tmp2;
}
return wrap(z1);
}
The main difference between the two codes is that in one case tmp1
and tmp2
is defined within the loop, and in the other case outside the loop.
Can anyone explain why the second loop crashes (and what can be changed to NOT make RStudio crash)? I'm still a newbie to C++ and primarily writing RCpp by looking at examples on SO or the RCpp gallery website - so would like to understand this behavior a little more.
Also, as a side benefit, if anyone can recommend changes to make the code faster, that will be great. The code that does NOT crash is currently around 2x-3x times faster than R's split.data.frame
function based on some test cases I used.
Example of test case:
> testDF
V1 V2 V3 V4 V5 V6
1 1 5 4 1 3 2
2 2 1 5 4 1 3
3 2 2 1 5 4 1
4 3 2 2 1 5 4
5 1 3 2 2 1 5
6 4 1 3 2 2 1
7 1 5 4 1 3 2
8 2 1 5 4 1 3
9 2 2 1 5 4 1
10 3 2 2 1 5 4
11 1 3 2 2 1 5
12 4 1 3 2 2 1
> testSp<-c(1,1,1,2,2,2,3,4,4,3,3,5)
> split(testDF,testSp) OR > splitDataFrameCpp(testDF,testSp)
$`1`
V1 V2 V3 V4 V5 V6
1 1 5 4 1 3 2
2 2 1 5 4 1 3
3 2 2 1 5 4 1
$`2`
V1 V2 V3 V4 V5 V6
4 3 2 2 1 5 4
5 1 3 2 2 1 5
6 4 1 3 2 2 1
$`3`
V1 V2 V3 V4 V5 V6
7 1 5 4 1 3 2
10 3 2 2 1 5 4
11 1 3 2 2 1 5
$`4`
V1 V2 V3 V4 V5 V6
8 2 1 5 4 1 3
9 2 2 1 5 4 1
$`5`
V1 V2 V3 V4 V5 V6
12 4 1 3 2 2 1
The microbenchmark
result for this test case:
> microbenchmark(t1<-split(testDF,testSp),t2<-splitDataFrameCpp(testDF,testSp))
Unit: microseconds
expr min lq median uq max neval
t1 <- split(testDF, test2) 343.181 365.562 372.8760 387.9430 1027.786 100
t2 <- splitDataFrameCpp(testDF, test2) 177.881 190.315 200.5545 208.4545 870.093 100
* EDIT *
Added the sessionInfo
:
> sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] microbenchmark_1.3-0
loaded via a namespace (and not attached):
[1] Rcpp_0.11.1 tools_3.1.0
Also, testDF was created as a numeric
in R, not integer
.
Upvotes: 3
Views: 935
Reputation: 368241
For what it is worth, here is a complete example you can just sourceCpp()
. And similar to what Kevin and Romain noted, it does not blow up for me either.
#include <Rcpp.h>
using namespace Rcpp;
using namespace std;
//[[Rcpp::export]]
List splitDataFrameCppA(DataFrame x,NumericVector y) {
int nRows=x.nrows();
int nCols=x.size();
std::map<double,vector<double> > z;
for (int i=0;i<nCols;i++) {
std::vector<double> tmp=Rcpp::as<std::vector<double> > (x[i]);
for (int j=0;j<nRows;j++) {
z[y[j]].push_back(tmp[j]);
}
}
std::vector<double> yunq=Rcpp::as<std::vector<double> > (sort_unique(y));
std::map<double, DataFrame> z1;
for (int i=0;i<int(yunq.size());i++) {
NumericVector tmp1=wrap(z[yunq[i]]); // *** DEFINING INSIDE LOOP ***
tmp1.attr("dim")=Dimension(int(tmp1.size())/nCols,nCols);
DataFrame tmp2(wrap(tmp1)); // *** DEFINING INSIDE LOOP ***
tmp2.attr("names")=x.attr("names");
z1[yunq[i]]=tmp2;
}
return wrap(z1);
}
//[[Rcpp::export]]
List splitDataFrameCppB(DataFrame x,NumericVector y) {
int nRows=x.nrows();
int nCols=x.size();
std::map<double,vector<double> > z;
for (int i=0;i<nCols;i++) {
std::vector<double> tmp=Rcpp::as<std::vector<double> > (x[i]);
for (int j=0;j<nRows;j++) {
z[y[j]].push_back(tmp[j]);
}
}
std::vector<double> yunq=Rcpp::as<std::vector<double> > (sort_unique(y));
std::map<double, DataFrame> z1;
NumericVector tmp1; // *** DEFINING OUTSIDE LOOP ***
DataFrame tmp2; // *** DEFINING OUTSIDE LOOP ***
for (int i=0;i<int(yunq.size());i++) {
tmp1=wrap(z[yunq[i]]);
tmp1.attr("dim")=Dimension(int(tmp1.size())/nCols,nCols);
tmp2=wrap(tmp1);
tmp2.attr("names")=x.attr("names");
z1[yunq[i]]=tmp2;
}
return wrap(z1);
}
/*** R
testDF <- read.table(textConnection("
1 5 4 1 3 2
2 1 5 4 1 3
2 2 1 5 4 1
3 2 2 1 5 4
1 3 2 2 1 5
4 1 3 2 2 1
1 5 4 1 3 2
2 1 5 4 1 3
2 2 1 5 4 1
3 2 2 1 5 4
1 3 2 2 1 5
4 1 3 2 2 1
"))
testSp <- c(1,1,1,2,2,2,3,4,4,3,3,5)
str(splitDataFrameCppA(testDF, testSp))
str(splitDataFrameCppB(testDF, testSp))
library(microbenchmark)
microbenchmark(split(testDF,testSp),
splitDataFrameCppA(testDF,testSp),
splitDataFrameCppB(testDF,testSp))
*/
The benchmark is about even between your two version:
R> library(microbenchmark)
R> microbenchmark(split(testDF,testSp),
+ splitDataFrameCppA(testDF,testSp),
+ splitDataFrameCppB(testDF,testSp))
Unit: microseconds
expr min lq median uq max neval
split(testDF, testSp) 687.271 724.748 745.287 791.574 2373.283 100
splitDataFrameCppA(testDF, testSp) 380.781 393.161 406.686 421.469 491.803 100
splitDataFrameCppB(testDF, testSp) 377.959 393.391 405.476 429.947 2052.193 100
R>
R>
Upvotes: 4