Reputation: 53
I'm trying to tokenize the input consisting of UTF-8 characters. While some trying the learn utf8 i get an output that i cannot understand. when i input the characher π (pi) i get three different numbers 207 128 10. How can i use them to control which category it is belong to?
ostringstream oss;
oss << cin.rdbuf();
string input = oss.str();
for(int i=0; i<input.size(); i++)
{
unsigned char code_unit = input[i];
cout << (int)code_unit << endl;
}
Thanks in advance.
Upvotes: 0
Views: 104
Reputation: 110658
Characters encoded with UTF-8 may take up more than a single byte (and often do). The number of bytes used to encode a single code point can vary from 1 byte to 6 bytes (or 4 under RFC 3629). In the case of π, the UTF-8 encoding, in binary, is:
11001111 10000000
That is, it is two bytes. You are reading these bytes out individually. The first byte has decimal value 207 and the second has decimal value 128 (if you interpret as an unsigned integer). The following byte that you're reading has decimal value 10 and is the Line Feed character which you're giving when you hit enter.
If you're going to do any processing of these UTF-8 characters, you're going to need to interpret what the bytes mean. What exactly you'll need to do depends on how you're categorising the characters.
Upvotes: 3