suyash
suyash

Reputation: 789

matching text ranges with C++11 regexes

I am trying regexes in C++, and here is some code

#include <iostream>
#include <regex>


int main (int argc, char *argv[]) {
  std::regex pattern("[a-z]+", std::regex_constants::icase);
  std::regex pattern2("excelsior", std::regex_constants::icase);
  std::string text = "EXCELSIOR";

  if (std::regex_match(text, pattern)) std::cout << "works" << std::endl;
  else std::cout << "doesn't work" << std::endl;

  if (std::regex_match(text, pattern2)) std::cout << "works" << std::endl;
  else std::cout << "doesn't work" << std::endl;

  return 0;
}

Now, from what I understand, both those matches should output works, but the first one outputs doesn't work, while the second one outputs works as expected. Why?

Upvotes: 2

Views: 360

Answers (3)

Barry
Barry

Reputation: 303537

Based on the rules described in [re.grammar], we have:

— During matching of a regular expression finite state machine against a sequence of characters, two characters c and d are compared using the following rules:
1. if (flags() & regex_constants::icase) the two characters are equal if traits_inst.translate_nocase(c) == traits_inst.translate_nocase(d);
2. otherwise, if flags() & regex_constants::collate the two characters are equal if traits_inst.translate(c) == traits_inst.translate(d);
3. otherwise, the two characters are equal if c == d.

This applies to your pattern2, we're matching a sequence of characters and we have flags() & icase, so we do a nocase comparison. Since each character in the sequence matches, it "works".

However, with pattern, we don't have a sequence of characters. So we instead use this rule:

— During matching of a regular expression finite state machine against a sequence of characters, comparison of a collating element range c1-c2 against a character c is conducted as follows: if flags() & regex_constants::collate is false then the character c is matched if c1 <= c && c <= c2, otherwise c is matched in accordance with the following algorithm:

string_type str1 = string_type(1,
    flags() & icase ?
        traits_inst.translate_nocase(c1) : traits_inst.translate(c1);
string_type str2 = string_type(1,
    flags() & icase ?
        traits_inst.translate_nocase(c2) : traits_inst.translate(c2);
string_type str = string_type(1,
    flags() & icase ?
        traits_inst.translate_nocase(c) : traits_inst.translate(c);
return traits_inst.transform(str1.begin(), str1.end())
        <= traits_inst.transform(str.begin(), str.end())
    && traits_inst.transform(str.begin(), str.end())
        <= traits_inst.transform(str2.begin(), str2.end());

Since you don't have collate set, the character is matched literally for the range a-z. There is no accounting for icase here, that is why it "doesn't work." If you provide collate however:

std::regex pattern("[a-z]+", 
                   std::regex_constants::icase | std::regex_constants::collate);

Then we use the algorithm described, which will do a no-case comparison, and the result will be "works". Both compilers are correct - though I find the expected behavior confusing in this case.

Upvotes: 2

mMontu
mMontu

Reputation: 9273

The problem is caused by case sensitiveness:

http://coliru.stacked-crooked.com/a/ac21a962ee9f28fc

The flag std::regex_constants::icase is ignored by std::regex_match.


Edit:

Adding the flag std::regex_constants::collate solves the problem:

http://coliru.stacked-crooked.com/a/f57a2f2ff840c8be

Upvotes: 0

πάντα ῥεῖ
πάντα ῥεῖ

Reputation: 1

std::regex pattern("[a-z]+", std::regex_constants::icase);

still restricts pattern matching for lower case letters. I'd suppose character matching as mentioned in the reference seems not to apply for explicitly specified character sets, which is what I would expect and makes sense to handle these if specified explicitly.

Upvotes: 1

Related Questions