sven
sven

Reputation: 401

Stack overflow caused by backtracking in regular expression

I want to match a string using the following regular expression with C++11 using std::regex:

([^;]|'[^']*')*

What I am trying to achieve is the following: I want to match all characters except the semicolon, however, the semicolon should be ignored if enclosed by apostrophes (this denotes a string). When trying to apply the regular expression I get a stack overflow. I realize that the problem is caused by excessive backtracking. Unfortunately, I don't know how to get rid of the problem. How can the expression be rewritten such that it does not cause loads of backtracking?

Minimal code example with a dummy text:

#include <regex>
#include <string>
int main()
{
    std::string str = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed ut suscipit enim. Praesent varius purus ac sem vulputate pulvinar. Mauris scelerisque arcu tortor, at bibendum dui rhoncus ut. Nunc dictum malesuada condimentum. Mauris ornare nunc eget efficitur tempor. Morbi ex nibh, consectetur vitae bibendum id, bibendum varius purus. Proin finibus quam vel ornare molestie. Mauris condimentum nisi efficitur, fringilla massa ut, commodo diam. Mauris lobortis laoreet magna sed commodo. Donec faucibus lectus placerat ex pulvinar interdum.";
    std::regex rgx("([^;]|'[^']*')*");
    std::regex_match(std::begin(str), std::end(str), rgx); // Stack overflow!
    return 0;
}

I am using Visual Studio 2012.

Upvotes: 2

Views: 678

Answers (1)

Christophe
Christophe

Reputation: 73446

If you switch the regex grammar to basic posix instead of deafult ECMAscript, it will no longer overflow:

std::regex rgx("([^;]|'[^']*')*", std::regex_constants::basic);

I tested with MSVC2013 and it works. Unfortunately, it will not match your expectations either, as demonstrates this small variant:

...
std::smatch cm;
std::regex_search(str, cm, rgx, std::regex_constants::match_any | std::regex_constants::match_continuous | std::regex_constants::match_not_null); // No overflow!
for (int i = 0; i < cm.size(); i++)
    std::cout << "Found:" << cm[i];  // but no result is printed out. 

If choosing std::regex_constants::extended option, the stack overflow is back again.

If you test your regex with your data on RegExr you will understand the problem:

enter image description here

The regex causes an infinity of potential matches. As soon as you reduce the potential set of matches for example by putting a ; after Lorem there is no stack overflow anymore.

This is indeed very unfortunate: no standard matching option (e.g. std::regex_constants::match_any | std::regex_constants::match_continuous | std::regex_constants::match_not_null) seems to take care of this frequent problem (e.g.: no "match largest").

Nevertheless, it remains related to the MSVC implementation of the standard library. This online GCC example runs fine with the same expression. So the boost or RE2 alternative could be worth to consider.

This code with boost 1.57 on MSVC2013, ran without any problem. As you see, it uses the same code/names as the standard, but replaces the std with boost:

#include <iostream>
#include <boost/regex.hpp>  // instead of <regex>
#include <string>
                            // use the boost alternative to std
using namespace boost::regex_constants;  
using boost::regex; using boost::smatch; using boost::regex_search;

int main()
{
    std::string str = "your very long string..." ;
    regex rgx("([^;]|'[^']*')*", extended | nosubs);
    smatch cm;
    //regex_match(str, cm, rgx, regex_constants:: match_continuous | match_not_null); // Stack overflow!
    regex_search(str, cm, rgx, match_continuous | match_not_null); 
    for (int i = 0; i < cm.size(); i++)
        std::cout << "Found:" << cm[i];
    return 0;
}

Upvotes: 2

Related Questions