Reputation: 69
I would like to know what method you would use to get each character from a std::string and store it in another std::string.
I find the problem when the std::string has special characters, such as "á". If I do:
std::string test = "márcos";
std::string char1 = std::string(1, test.at(0));
std::string char2 = std::string(1, test.at(1));
std::string char3 = std::string(1, test.at(2));
std::string char4 = std::string(1, test.at(3));
std::cout << "Result: " << char1 << " -- " << char2 << " -- " << char3 << " -- " << char4 << std::endl;
Output: Result: m -- � -- � -- r
As you can see, the desired result would be: "m - á - r - c" but this is not the case because the special character is stored as two characters.
How can we solve this? thanks :)
Upvotes: 1
Views: 102
Reputation: 17658
The number of bytes (between one and four) used to encode a codepoint in UTF-8 can be determined by looking at the high bits of the leading byte.
bytes codepoints byte 1 byte 2 byte 3 byte 4
1 U+0000 .. U+007F 0xxxxxxx
2 U+0080 .. U+07FF 110xxxxx 10xxxxxx
3 U+0800 .. U+FFFF 1110xxxx 10xxxxxx 10xxxxxx
4 U+10000 .. U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
The following breaks a UTF-8 encoded std::string
into the individual characters.
#include <string>
#include <iostream>
int bytelen(char c)
{
if(!c) return 0; // empty string
if(!(c & 0x80)) return 1; // ascii char ($)
if((c & 0xE0) == 0xC0) return 2; // 2-byte codepoint (¢)
if((c & 0xF0) == 0xE0) return 3; // 3-byte codepoint (€)
if((c & 0xF8) == 0xF0) return 4; // 4-byte codepoint (𐍈)
return -1; // error
}
int main()
{
std::string test = "$¢€𐍈";
std::cout << "'" << test << "' length = " << test.length() << std::endl;
for(int off = 0, len; off < test.length(); off += len)
{
len = bytelen(test[off]);
if(len < 0) return 1;
std::string chr = test.substr(off, len);
std::cout << "'" << chr << "'" << std::endl;
}
return 0;
}
'$¢€𐍈' length = 10
'$'
'¢'
'€'
'𐍈'
Upvotes: 1