Reputation: 19032
I have a text file that is currently parsed with a regex
expression, and it's working well. The file format is well defined, 2 numbers, separated by any whitespace, followed by an optional comment.
Now, we have a need to add an additional (but optional) 3rd number to this file, making the format, 2 or 3 numbers separated by whitespace with an optional comment.
I've got a regex
object that at least matches all the necessary line formats, but I am not having any luck with actually capturing the 3rd (optional) number even if it is present.
Code:
#include <iostream>
#include <regex>
#include <vector>
#include <string>
#include <cassert>
using namespace std;
bool regex_check(const std::string& in)
{
std::regex check{
"[[:space:]]*?" // eat leading spaces
"([[:digit:]]+)" // capture 1st number
"[[:space:]]*?" // each second set of spaces
"([[:digit:]]+)" // capture 2nd number
"[[:space:]]*?" // eat more spaces
"([[:digit:]]+|[[:space:]]*?)" // optionally, capture 3rd number
"!*?" // Anything after '!' is a comment
".*?" // eat rest of line
};
std::smatch match;
bool result = std::regex_match(in, match, check);
for(auto m : match)
{
std::cout << " [" << m << "]\n";
}
return result;
}
int main()
{
std::vector<std::string> to_check{
" 12 3",
" 1 2 ",
" 12 3 !comment",
" 1 2 !comment ",
"\t1\t1",
"\t 1\t 1\t !comment \t",
" 16653 2 1",
" 16654 2 1 ",
" 16654 2 1 ! comment",
"\t16654\t\t2\t 1\t ! comment\t\t",
};
for(auto s : to_check)
{
assert(regex_check(s));
}
return 0;
}
This gives the following output:
[ 12 3]
[12]
[3]
[]
[ 1 2 ]
[1]
[2]
[]
[ 12 3 !comment]
[12]
[3]
[]
[ 1 2 !comment ]
[1]
[2]
[]
[ 1 1]
[1]
[1]
[]
[ 1 1 !comment ]
[1]
[1]
[]
[ 16653 2 1]
[16653]
[2]
[]
[ 16654 2 1 ]
[16654]
[2]
[]
[ 16654 2 1 ! comment]
[16654]
[2]
[]
[ 16654 2 1 ! comment ]
[16654]
[2]
[]
As you can see, it's matching all of the expected input formats, but never is able to actually capture the 3rd number, even if it is present.
I'm currently testing this with GCC 5.1.1, but that actual target compiler will be GCC 4.8.2, using boost::regex
instead of std::regex
.
Upvotes: 4
Views: 2739
Reputation: 51330
Let's do a step-by-step processing on the following example.
16653 2 1
^
^
is the currently matched offset. At this point, we're here in the pattern:
\s*?(\d+)\s*?(\d+)\s*?(\d+|\s*?)!*?.*?
^
(I've simplified [[:space:]]
to \s
and [[:digit:]]
to \d
for brievty.
\s*?
matches, and then (\d+)
matches. We end up in the following state:
16653 2 1
^
\s*?(\d+)\s*?(\d+)\s*?(\d+|\s*?)!*?.*?
^
Same thing: \s*?
matches, and then (\d+)
matches. The state is:
16653 2 1
^
\s*?(\d+)\s*?(\d+)\s*?(\d+|\s*?)!*?.*?
^
Now, things get trickier.
You have a \s*?
here, a lazy quantifier. The engine tries to not match anything, and sees if the rest of the pattern will match. So it tries the alternation.
The first alternative is \d+
, but it fails, since you don't have a digit at this position.
The second alternative is \s*?
, and there are no other alternatives after that. It's lazy, so let's try to match the empty string first.
The next token is !*?
, but it also matches the empty string, and it is then followed by .*?
, which will match everything up to the end of the string (it does so because you're using regex_match
- it would have matched the empty string with regex_search
).
At this point, you've reached the end of the pattern successfully, and you got a match, without being forced to match \d+
against the string.
The thing is, this whole part of the pattern ends up being optional:
\s*?(\d+)\s*?(\d+)\s*?(\d+|\s*?)!*?.*?
\__________________/
So, what can you do? You can rewrite your pattern like so:
\s*?(\d+)\s+(\d+)(?:\s+(\d+))?\s*(?:!.*)?
Demo (with added anchors to mimic regex_match
behavior)
This way, you're forcing the regex engine to consider \d
and not get away with lazy-matching on the empty string. No need for lazy quantifiers since \s
and \d
are disjoint.
!*?.*?
also was suboptimal, since !*?
is already covered by the following .*?
. I rewrote it as (?:!.*)?
to require a !
at the start of a comment, if it's not there the match will fail.
Upvotes: 4