Reputation: 169
I'd like to think this isn't actually a bug in the standard library, but I'm running out of places to look.
The statement std::regex(expression)
where expression
is a std::string causes a memory access fatal error.
expression
is declared by the statement:
std::string expression = std::string("^(") +
std::string("[\x09\x0A\x0D\x20-\x7E]|") + // ASCII
std::string("[\xC2-\xDF][\x80-\xBF]|") + // non-overlong 2-byte
std::string("\xE0[\xA0-\xBF][\x80-\xBF]|") + // excluding overlong
std::string("[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}|") + // straight 3-byte
std::string("\xED[\x80-\x9F][\x80-\xBF]|") + // excluding surrogates
std::string("\xF0[\x90-\xBF][\x80-\xBF]{2}|") + // planes 1-3
std::string("[\xF1-\xF3][\x80-\xBF]{3}|") + // planes 4-15
std::string("\xF4[\x80-\x8F][\x80-\xBF]{2}") + // plane 16
")*$";
This regex was taken from http://www.w3.org/International/questions/qa-forms-utf-8 to test whether a byte sequence is UTF8.
Is this actually a bug in the library, or am I missing something really tiny?
Compiled with VS2015 c++, if that happens to make a difference.
EDIT:
I forgot to mention that there is one specific line in this that breaks the code. std::string("[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}|") + // straight 3-byte
is the only line that breaks. comment that out and it works fine. This line on it's own creates a memory access error.
Upvotes: 0
Views: 146
Reputation:
So, if you use escapes in string literals, without using raw syntax,
you have to escape the escapes.
Example, new string:
std::string expression = std::string("^(") +
std::string("[\\x09\\x0A\\x0D\\x20-\\x7E]|") + // ASCII
std::string("[\\xC2-\\xDF][\\x80-\\xBF]|") + // non-overlong 2-byte
std::string("\\xE0[\\xA0-\\xBF][\\x80-\\xBF]|") + // excluding overlong
std::string("[\\xE1-\\xEC\\xEE\\xEF][\\x80-\\xBF]{2}|") + // straight 3-byte
std::string("\\xED[\\x80-\\x9F][\\x80-\\xBF]|") + // excluding surrogates
std::string("\\xF0[\\x90-\\xBF][\\x80-\\xBF]{2}|") + // planes 1-3
std::string("[\\xF1-\\xF3][\\x80-\\xBF]{3}|") + // planes 4-15
std::string("\\xF4[\\x80-\\x8F][\\x80-\\xBF]{2}") + // plane 16
")*$";
When you don't escape them, the compiler tries to interpret it as a
special character. In this case it is interpreting those as hex binary characters.
And, while the regex engine probably gets the right character,
it is always better to pass hex to the engine so you can see the character
that might break it (if it does).
Upvotes: 1