Shray
Shray

Reputation: 181

C++ regex_search vs regex matching in perl

I have a log file, which consists of 400k log lines. I found out, that my c++ code is very slow in comparison to perl code. So I made a simple iteration over my log file and used regex of c++ and of perl. Perl scripts executes very fast while on the other hand c++ is taking time.

In c++ i have in use #include<regex> library. Whereas in perl, regex can be used directly. How can I make c++ code as efficient as perl? Since perl's implementation is by C only.

regex log_line("(\\d{1,2}\\/[A-Za-z]{3}\\/\\d{1,4}):(\\d{1,2}:\\d{1,2}:\\d{1,2}).*?\".*?[\\s]+(.*?)[\\s\?].*?\"[\\s]+([\\d]{3})[\\s]+(\\d+)[\\s]+\"(.*?)\"[\\s]+\"(.*?)\"[\\s]+(\\d+)");
string line;
int count =0;
smatch match;
while(getline(logFileHandle, line){
    if(regex_search(line , match , log_line)==true){
    count++
}


open(N==LOG_FILE,"<$log_file_location");
        my $count=0;
        while($thisLine = <=LOG_FILE>){
            if((($datePart, $time, $requestUrl, $status, $bytesDelivered, $httpReferer, $httpUserAgent, $requestReceived) = $thisLine =~ /(\d{1,2}\/[A-Za-z]{3}\/\d{1,4}):(\d{1,2}:\d{1,2}:\d{1,2}).*?\".*?[\s]+(.*?)[\s\?].*?\"[\s]+([\d]{3})[\s]+(\d+)[\s]+\"(.*?)\"[\s]+\"(.*?)\"[\s]+(\d+)/o) == 8){
                $count++;
            }
        }

I'm afraid, if my question is not in the right format or something is missing let me know. Thanks.

EDIT 1 So I used chrono library in c++ to find out the time taken. Below is the output result. I took a sample of log file to make things easy. Simply reading the log file and counting no. of lines takes 57 ms. When regex_search is used it takes a whopping 2462 ms for the same sample log file.

No of Lines27399
With regex + logfileRead
Time taken by function: 2462 milliseconds
No of Lines27399
With just simple logfileRead
Time taken by function: 57 milliseconds

Upvotes: 1

Views: 1461

Answers (1)

Kelvin Sherlock
Kelvin Sherlock

Reputation: 853

Use a code generator tool like re2c or ragel to compile your regular expression into C/C++ code (which can be optimized by the compiler).

Alternatively, Boost.Regex -- which was the basis for std::regex -- may be faster than your std::regex implementation.

Also, the bottleneck might be I/O rather than regular expressions. Why is reading lines from stdin much slower in C++ than Python?

Upvotes: 2

Related Questions