Reputation: 789
I am trying regexes in C++, and here is some code
#include <iostream>
#include <regex>
int main (int argc, char *argv[]) {
std::regex pattern("[a-z]+", std::regex_constants::icase);
std::regex pattern2("excelsior", std::regex_constants::icase);
std::string text = "EXCELSIOR";
if (std::regex_match(text, pattern)) std::cout << "works" << std::endl;
else std::cout << "doesn't work" << std::endl;
if (std::regex_match(text, pattern2)) std::cout << "works" << std::endl;
else std::cout << "doesn't work" << std::endl;
return 0;
}
Now, from what I understand, both those matches should output works
, but the first one outputs doesn't work
, while the second one outputs works
as expected. Why?
Upvotes: 2
Views: 360
Reputation: 303537
Based on the rules described in [re.grammar], we have:
— During matching of a regular expression finite state machine against a sequence of characters, two characters
c
andd
are compared using the following rules:
1. if(flags() & regex_constants::icase)
the two characters are equal iftraits_inst.translate_nocase(c) == traits_inst.translate_nocase(d)
;
2. otherwise, ifflags() & regex_constants::collate
the two characters are equal iftraits_inst.translate(c) == traits_inst.translate(d);
3. otherwise, the two characters are equal ifc == d
.
This applies to your pattern2
, we're matching a sequence of characters and we have flags() & icase
, so we do a nocase comparison. Since each character in the sequence matches, it "works".
However, with pattern
, we don't have a sequence of characters. So we instead use this rule:
— During matching of a regular expression finite state machine against a sequence of characters, comparison of a collating element range
c1-c2
against a characterc
is conducted as follows: ifflags() & regex_constants::collate
is false then the characterc
is matched ifc1 <= c && c <= c2
, otherwisec
is matched in accordance with the following algorithm:string_type str1 = string_type(1, flags() & icase ? traits_inst.translate_nocase(c1) : traits_inst.translate(c1); string_type str2 = string_type(1, flags() & icase ? traits_inst.translate_nocase(c2) : traits_inst.translate(c2); string_type str = string_type(1, flags() & icase ? traits_inst.translate_nocase(c) : traits_inst.translate(c); return traits_inst.transform(str1.begin(), str1.end()) <= traits_inst.transform(str.begin(), str.end()) && traits_inst.transform(str.begin(), str.end()) <= traits_inst.transform(str2.begin(), str2.end());
Since you don't have collate
set, the character is matched literally for the range a-z
. There is no accounting for icase
here, that is why it "doesn't work." If you provide collate
however:
std::regex pattern("[a-z]+",
std::regex_constants::icase | std::regex_constants::collate);
Then we use the algorithm described, which will do a no-case comparison, and the result will be "works". Both compilers are correct - though I find the expected behavior confusing in this case.
Upvotes: 2
Reputation: 9273
The problem is caused by case sensitiveness:
http://coliru.stacked-crooked.com/a/ac21a962ee9f28fc
The flag std::regex_constants::icase
is ignored by std::regex_match
.
Edit:
Adding the flag std::regex_constants::collate
solves the problem:
http://coliru.stacked-crooked.com/a/f57a2f2ff840c8be
Upvotes: 0
Reputation: 1
std::regex pattern("[a-z]+", std::regex_constants::icase);
still restricts pattern matching for lower case letters. I'd suppose character matching as mentioned in the reference seems not to apply for explicitly specified character sets, which is what I would expect and makes sense to handle these if specified explicitly.
Upvotes: 1