Store each character from a std::string into a std::string

Question

I would like to know what method you would use to get each character from a std::string and store it in another std::string.

I find the problem when the std::string has special characters, such as "á". If I do:

std::string test = "márcos";

std::string char1 = std::string(1, test.at(0));
std::string char2 = std::string(1, test.at(1));
std::string char3 = std::string(1, test.at(2));
std::string char4 = std::string(1, test.at(3));

std::cout << "Result: " << char1 << " -- " << char2 << " -- " << char3  << " -- " << char4 << std::endl;

Output: Result: m -- � -- � -- r

As you can see, the desired result would be: "m - á - r - c" but this is not the case because the special character is stored as two characters.

How can we solve this? thanks :)

dxiv · Accepted Answer

The number of bytes (between one and four) used to encode a codepoint in UTF-8 can be determined by looking at the high bits of the leading byte.

bytes    codepoints             byte 1    byte 2    byte 3    byte 4
  1      U+0000  .. U+007F      0xxxxxxx        
  2      U+0080  .. U+07FF      110xxxxx  10xxxxxx        
  3      U+0800  .. U+FFFF      1110xxxx  10xxxxxx  10xxxxxx        
  4      U+10000 .. U+10FFFF    11110xxx  10xxxxxx  10xxxxxx  10xxxxxx

The following breaks a UTF-8 encoded std::string into the individual characters.

#include 
#include 

int bytelen(char c)
{
    if(!c)                  return 0;   // empty string
    if(!(c & 0x80))         return 1;   // ascii char       ($)
    if((c & 0xE0) == 0xC0)  return 2;   // 2-byte codepoint (¢)
    if((c & 0xF0) == 0xE0)  return 3;   // 3-byte codepoint (€)
    if((c & 0xF8) == 0xF0)  return 4;   // 4-byte codepoint (𐍈)

    return -1;                          // error
}

int main()
{
    std::string test = "$¢€𐍈";
    std::cout << "'" << test << "' length = " << test.length() << std::endl;

    for(int off = 0, len; off < test.length(); off += len)
    {
        len = bytelen(test[off]);
        if(len < 0) return 1;

        std::string chr = test.substr(off, len);
        std::cout << "'" << chr << "'" << std::endl;
    }

    return 0;
}

Output:

'$¢€𐍈' length = 10
'$'
'¢'
'€'
'𐍈'

Store each character from a std::string into a std::string

Answers (1)

Related Questions