Reputation: 2613
I would like a pattern like ".c", match "." with any utf8 followed by 'c' using std::regex.
I've tried under Microsoft C++ and g++. I get the same result, each time the "." only matches a single byte.
here's my test case:
#include <stdio.h>
#include <iostream>
#include <string>
#include <regex>
using namespace std;
int main(int argc, char** argv)
{
// make a string with 3 UTF8 characters
const unsigned char p[] = { 'a', 0xC2, 0x80, 'c', 0 };
string tobesearched((char*)p);
// want to match the UTF8 character before c
string pattern(".c");
regex re(pattern);
std::smatch match;
bool r = std::regex_search(tobesearched, match, re);
if (r)
{
// m.size() will be bytes, and we expect 3
// expect 0xC2, 0x80, 'c'
string m = match[0];
cout << "match length " << m.size() << endl;
// but we only get 2, we get the 0x80 and the 'c'.
// so it's matching on single bytes and not utf8
// code here is just to dump out the byte values.
for (int i = 0; i < m.size(); ++i)
{
int c = m[i] & 0xff;
printf("%02X ", c);
}
printf("\n");
}
else
cout << "not matched\n";
return 0;
}
I wanted the pattern ".c" to match 3 bytes of my tobesearched
string, where the first two are a 2-byte utf8 character followed by 'c'.
Upvotes: 1
Views: 2968
Reputation: 151
Some regex flavours support \X
which will match a single unicode character, which may consist of a number of bytes depending on the encoding. It is common practice for regex engines to get the bytes of the subject string in an encoding the engine is designed to work with, so you shouldn't have to worry about the actual encoding (whether it is US-ASCII, UTF-8, UTF-16 or UTF-32).
Another option is the \uFFFF
where FFFF refers to the unicode character at that index in the unicode charset. With that, you could create a ranged match inside a character class i.e. [\u0000-\uFFFF]
. Again, it depends on what the regex flavour supports. There is another variant of \u
in \x{...}
which does the same thing, except the unicode character index must be supplied inside curly braces, and need not be padded e.g. \x{65}
.
Edit: This website is amazing for learning more about regex across various flavours https://www.regular-expressions.info
Edit 2: To match any Unicode-exclusive character, i.e. excluding characters in the ASCII table / 1 byte characters, you can try "[\x{80}-\x{FFFFFFFF}]"
i.e. any character that has a value of 128-4,294,967,295 which is from the first character outside the ASCII range to the last unicode charset index which currently uses up to a 4-byte representation (was originally to be 6, and may change in future).
A loop through the individual bytes would be more efficient, though:
> -1
, it is a 1 byte char representation. Skip to the next byte and start again.> -17
, n=4
.> -33
, n=3
.> -65
, n=2
.n
bytes each start with 10, i.e. for each byte, if it has a signed value < -63
, it is invalid UTF-8 encoding.== 99
, you can say it matched - return true
.Upvotes: 2