Reputation: 1068
I have a code for parsing files with the help of std::regex'es. The problem is that the same regex takes significantly different time to execute when compiled on Linux with GCC and on Windows with LLVM 2014 (LLVM's regex is about 1000 times slower).
The files contain blocks of various standard formats such like this:
Material mat1 {
1.0; 1.0; 1.0; 1.0;;
0.21;
1.0; 1.0; 1.0;;
0.0; 0.0; 0.0;;
}
I have the following regex for matching this block:
regex rgMtr(
"\\r?\\n\\s+Material (\\S+) \\{\
\\r?\\n\\s+(\\d\\.\\d+); (\\d\\.\\d+); (\\d\\.\\d+); (\\d\\.\\d+);;\\s*\
\\r?\\n\\s+(\\d\\.\\d+);\\s*\
\\r?\\n\\s+(\\d\\.\\d+); (\\d\\.\\d+); (\\d\\.\\d+);;\\s*\
\\r?\\n\\s+(\\d\\.\\d+); (\\d\\.\\d+); (\\d\\.\\d+);;\\s*\
\\r?\\n\\s+\\}"
);
The matching is done as follows:
std::smatch sm;
string::const_iterator itst = text.begin();
string::const_iterator iten = text.end();
while( std::regex_search( itst, iten, sm, rgMtr ) ) {
// ... match processing ...
}
A 500 Kb file contains two such blocks in the very beginning. The fast regex (GCC) processes it in about 2 seconds. The slow regex (LLVM) finds the two matches almost instantly, but the processing of the rest of the file takes several minutes (maybe 15 or 20).
I tried various modifiers in the regex constructor and in the regex_search function call, but none gives noticeable results.
Is there some optimization or option that could be used to fix this problem?
UPDATE: on recommendation of Michael Burr adding more details:
Fast regex: compiled on ArchLinux, GCC 7.2.0 Slow regex: compiled on Windows 7, MSVC 2015 with LLVM 2014.
A minimal example can be the following:
int main()
{
regex rgMtr(
"\\r?\\n\\s+Material (\\S+) \\{\
\\r?\\n\\s+(\\d\\.\\d+); (\\d\\.\\d+); (\\d\\.\\d+); (\\d\\.\\d+);;\\s*\
\\r?\\n\\s+(\\d\\.\\d+);\\s*\
\\r?\\n\\s+(\\d\\.\\d+); (\\d\\.\\d+); (\\d\\.\\d+);;\\s*\
\\r?\\n\\s+(\\d\\.\\d+); (\\d\\.\\d+); (\\d\\.\\d+);;\\s*\
\\r?\\n\\s+\\}"
);
std::ifstream fs( "file name" );
string sf( (std::istreambuf_iterator<char>( fs )), std::istreambuf_iterator<char>());
std::smatch sm;
string::const_iterator itst = sf.begin();
string::const_iterator iten = sf.end();
while( std::regex_search( itst, iten, sm, rgMtr ) ) {
itst = itst + sm.position() + sm.length();
}
return 0;
}
The text file can be any text file of 500 Kb size with the text snippet cited above pasted at the beginning.
UPDATE 2: Constructed according to MSalter guidlines, this regex works OK:
regex rgMtr(
R"regex(\s+Material (\S+) \{
\s+(\d\.\d+); (\d\.\d+); (\d\.\d+); (\d\.\d+);;
\s+(\d\.\d+);
\s+(\d\.\d+); (\d\.\d+); (\d\.\d+);;
\s+(\d\.\d+); (\d\.\d+); (\d\.\d+);;
\s+\})regex"
);
Upvotes: 0
Views: 357
Reputation: 180020
The regex looks like if can have quite a bit of unnecessary backtracking.
In particular, it starts with an optional \r
. This is probably better solved by just using std::fstream
in text mode (not binary). It will replace platform-specific newline encodings (such as CR-LF) with \n
when reading.
The result is that the regex can now match \n
as its definite first character.
Furthermore, I think your \r?\n\s+
is missing the fact that both \r
and \n
are part of the \s
whitespace class. I think you meant just horizontal whitespace, [ \t]+
. This is especially a problem for the newlines further in your regex, as they are sandwiched between ;\s*
from the previous line and \s+
following.
This is a specific instance of a generic problem with regexes; having two adjacent subexpressions of variable length and overlapping classes is a bad idea. You may get lucky with optimized implementations of .*
.
Upvotes: 2