James
James

Reputation: 169

std::regex fatal error

I'd like to think this isn't actually a bug in the standard library, but I'm running out of places to look.

The statement std::regex(expression) where expression is a std::string causes a memory access fatal error.

expression is declared by the statement:

std::string expression = std::string("^(") +
    std::string("[\x09\x0A\x0D\x20-\x7E]|") + // ASCII
    std::string("[\xC2-\xDF][\x80-\xBF]|") + // non-overlong 2-byte
    std::string("\xE0[\xA0-\xBF][\x80-\xBF]|") + // excluding overlong
    std::string("[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}|") + // straight 3-byte
    std::string("\xED[\x80-\x9F][\x80-\xBF]|") + // excluding surrogates
    std::string("\xF0[\x90-\xBF][\x80-\xBF]{2}|") + // planes 1-3
    std::string("[\xF1-\xF3][\x80-\xBF]{3}|") + // planes 4-15
    std::string("\xF4[\x80-\x8F][\x80-\xBF]{2}") + // plane 16
    ")*$";

This regex was taken from http://www.w3.org/International/questions/qa-forms-utf-8 to test whether a byte sequence is UTF8.

Is this actually a bug in the library, or am I missing something really tiny?

Compiled with VS2015 c++, if that happens to make a difference.

EDIT: I forgot to mention that there is one specific line in this that breaks the code. std::string("[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}|") + // straight 3-byte is the only line that breaks. comment that out and it works fine. This line on it's own creates a memory access error.

Upvotes: 0

Views: 146

Answers (1)

user557597
user557597

Reputation:

So, if you use escapes in string literals, without using raw syntax,
you have to escape the escapes.

Example, new string:

std::string expression = std::string("^(") +
    std::string("[\\x09\\x0A\\x0D\\x20-\\x7E]|") + // ASCII
    std::string("[\\xC2-\\xDF][\\x80-\\xBF]|") + // non-overlong 2-byte
    std::string("\\xE0[\\xA0-\\xBF][\\x80-\\xBF]|") + // excluding overlong
    std::string("[\\xE1-\\xEC\\xEE\\xEF][\\x80-\\xBF]{2}|") + // straight 3-byte
    std::string("\\xED[\\x80-\\x9F][\\x80-\\xBF]|") + // excluding surrogates
    std::string("\\xF0[\\x90-\\xBF][\\x80-\\xBF]{2}|") + // planes 1-3
    std::string("[\\xF1-\\xF3][\\x80-\\xBF]{3}|") + // planes 4-15
    std::string("\\xF4[\\x80-\\x8F][\\x80-\\xBF]{2}") + // plane 16
    ")*$";

When you don't escape them, the compiler tries to interpret it as a
special character. In this case it is interpreting those as hex binary characters.

And, while the regex engine probably gets the right character,
it is always better to pass hex to the engine so you can see the character
that might break it (if it does).

Upvotes: 1

Related Questions