ggrr
ggrr

Reputation: 7867

how to print each character of strings that mix ascii character with unicode?

for example, I want to create some typewriter effects so need to print strings like that:

#include <string>
int main(){
    std::string st1="ab》cd《ef";
    for(int i=0;i<st1.size();i++){
        std::string st2=st1.substr(0,i).c_str();
        printf("%s\n",st2.c_str());
    }
    return 0;
}

but the output is

a
ab
ab?
ab?
ab》
ab》c
ab》cd
ab》cd?
ab》cd?
ab》cd《
ab》cd《e

and not:

a
ab
ab》
ab》c
ab》cd
ab》cd《
ab》cd《e

how to know the upcoming character is unicode?

similar question, print each character also has the problem:

#include <string>
int main(){
    std::string st1="ab》cd《ef";
    for(int i=0;i<st1.size();i++){
        std::string st2=st1.substr(i,1).c_str();
        printf("%s\n",st2.c_str());
    }
    return 0;
}

the output is:

a
b
?
?
?
c
d
?
?
?
e
f

not:

a
b
》
c
d
《
e
f

Upvotes: 0

Views: 653

Answers (2)

Galik
Galik

Reputation: 48635

I think the problem is encoding. Likely your string is in UTF-8 encoding which has variable sized characters. This means you can not iterate one char at a time because some characters are more than one char wide.

The fact is, in unicode, you can only iterate reliably one fixed character at a time with UTF-32 encoding.

So what you can do is use a UTF library like ICU to convert vetween UTF-8 and UTF-32.

If you have C++11 then there are some tools to help you here, mostly std::u32string which is able to hold UTF-32 encoded strings:

#include <string>
#include <iostream>

#include <unicode/ucnv.h>
#include <unicode/uchar.h>
#include <unicode/utypes.h>

// convert from UTF-32 to UTF-8
std::string to_utf8(std::u32string s)
{
    UErrorCode status = U_ZERO_ERROR;
    char target[1024];
    int32_t len = ucnv_convert(
        "UTF-8", "UTF-32"
        , target, sizeof(target)
        , (const char*)s.data(), s.size() * sizeof(char32_t)
        , &status);
    return std::string(target, len);
}

// convert from UTF-8 to UTF-32
std::u32string to_utf32(const std::string& utf8)
{
    UErrorCode status = U_ZERO_ERROR;
    char32_t target[256];
    int32_t len = ucnv_convert(
        "UTF-32", "UTF-8"
        , (char*)target, sizeof(target)
        , utf8.data(), utf8.size()
        , &status);
    return std::u32string(target, (len / sizeof(char32_t)));
}

int main()
{
    // UTF-8 input (needs UTF-8 editor)
    std::string utf8 = "ab》cd《ef"; // UTF-8

    // convert to UTF-32
    std::u32string utf32 = to_utf32(utf8);

    // Now it is safe to use string indexing
    // But i is for length so starting from 1
    for(std::size_t i = 1; i < utf32.size(); ++i)
    {
        // convert back to to UTF-8 for output
        // NOTE: i + 1 to include the BOM
        std::cout << to_utf8(utf32.substr(0, i + 1)) << '\n';
    }
}

Output:

a
ab
ab》
ab》c
ab》cd
ab》cd《
ab》cd《e
ab》cd《ef

NOTE:

The ICU library adds a BOM (Byte Order Mark) at the beginning of the strings it converts into Unicode. Therefore you need to deal with the fact that the first character of the UTF-32 string is the BOM. This is why the substring uses i + 1 for its length parameter to include the BOM.

Upvotes: 1

Sam Varshavchik
Sam Varshavchik

Reputation: 118415

Your C++ code is simply echoing octets to your terminal, and it is your terminal display that's converting octets encoded in its default character set to unicode characteers.

It looks like, based on your example, that your terminal display uses UTF-8. The rules for converting UTF-8-encoded characters to unicode are fairly well specified (Google is your friend), so all you have to do is to check the first character of a UTF-8 sequence to figure out how many octets make up the next unicode character.

Upvotes: 0

Related Questions