Yayahii
Yayahii

Reputation: 43

Perl Regular Expression running faster than C++ Boost Implementation

I am kind of confused as to what is happening here. Most benchmarks I have seen have Boost being close to Perl or even beating it in terms of performance. In my scripts however, my Perl implementation is faster in order of 5-6 times.

I open files in both test_script.cpp & test_script.pl and read in line by line, populating an array. Then, I run these strings against a list of regex definitions in a linear definition until they match, in which case nothing happens (I/O was removed for testing purposes) and then the next string is compared, etc until we have compared all strings.

Test_script.pl:

#make incomingList, which contains all incoming strings
my $start = Time::HiRes::gettimeofday();

foreach (@incomingList) {
  my $inString = $_;
  &find_pattern($inString);
}

my $end = Time::HiRes::gettimeofday();
printf("%.6f\n", $end - $start);

Find_pattern method:

sub find_pattern {
  my $URLString = $_[0];

  #1 rewrite
  if($URLString =~ m/^\/stuff\/brands-([^\/]*)\/(.*)?$/) {

  }
  #2 rewrite
  elsif($URLString =~ m/^\/coupons(\/.*)?$/){

  }
  #3 rewrite
  elsif($URLString =~ m/^\/han\/(.+)$/){

  }
  # ...continues on, there are 100 patterns. 
}

Test_script.cpp: Main method:

populateArray();
//make stringArr, which contains all incoming strings
struct timeval time;
gettimeofday(&time, NULL);
double t1=time.tv_sec+(time.tv_usec/1000000.0);   

for(int j =0; j < 10000; j++){
  getRule(stringArr[j]);
 }

gettimeofday(&time, NULL);
double t2=time.tv_sec+(time.tv_usec/1000000.0);
printf("%.6lf seconds elapsed\n", t2-t1);

populate array method:

static void populateArray(){
regexArray[1] =  boost::regex ("\\/stuff\\/brands-([^\\/]*)\\/(.*)?");
regexArray[2] =  boost::regex ("\\/coupons(\\/.*)?");
regexArray[3] =  boost::regex ("\\/han\\/(.+)"); 
//continues on, 100 definitions. 
}

getRule method:

static void getRule(string inQuery){
  for(int i =1; i < 100; i++){
    if(boost::regex_match(inQuery, regexArray[i])){
      break; 
     }
  }

I understand that it might seem a little odd that I'm doing a linear list of if else checks in perl, but that's because I have to reformat each rule independently later. Regardless, unless I'm misunderstanding something, these two scripts are pretty similar- they look down this list of regex definitions until they find a match, and then they continue with other incoming strings.

So then why are these results so different? For 100 rules (same used for both scripts) & 10,000 inputs, The .cpp averages to around 0.155 seconds, and the .pl averages to around 0.028 seconds. Edit: With compiler optimization in place, the C++ script is operating at roughly 0.091 seconds, still slower.

Any insight is appreciated.

Upvotes: 1

Views: 190

Answers (1)

T33C
T33C

Reputation: 4429

In addition to turning on the compiler optimisation settings, try using the boost::regex_constants::optimize option which will direct the regex library to construct the most optimal regex state machine.

static void populateArray(){
regexArray[1] =  boost::regex ("\\/stuff\\/brands-([^\\/]*)\\/(.*)?", boost::regex_constants::optimize);
//continues on, 102 definitions. 
}

Also, be sure to pass by reference to getRule rather than by value because you don't want the potential overhead of a heap allocation.

If you can make sure the compiler inlines the function, that would best.

Also, as Oals commented above, you have not used the begin and end line anchors in the C++ regex expressions like you have in the Perl ones. ^...$

Upvotes: 3

Related Questions