Kamchatka
Kamchatka

Reputation: 3675

Convert from UTF-8 to ISO8859-15 in C++

I would like to do a conversion from UTF-8 to ISO 8859-15 in C/C++, without including an additional library.

How can I achieve this?

I have found the following piece of code that works for ISO 8859-1 but I'm not sure about how to handle the differences between ISO 8859-15 and ISO 8859-1 (https://en.wikipedia.org/wiki/ISO/IEC_8859-15) :

std::string UTF8toISO8859_1(const char * in) {
    std::string out;
    if (in == NULL)
        return out;

    unsigned int codepoint;
    while (*in != 0) {
        unsigned char ch = static_cast<unsigned char>(*in);
        if (ch <= 0x7f)
            codepoint = ch;
        else if (ch <= 0xbf)
            codepoint = (codepoint << 6) | (ch & 0x3f);
        else if (ch <= 0xdf)
            codepoint = ch & 0x1f;
        else if (ch <= 0xef)
            codepoint = ch & 0x0f;
        else
            codepoint = ch & 0x07;
        ++in;
        if (((*in & 0xc0) != 0x80) && (codepoint <= 0x10ffff)) {
            if (codepoint <= 255) {
                out.append(1, static_cast<char>(codepoint));
            }
            else {
                out.append("?");
            }
        }
    }
    return out;
}

Upvotes: 1

Views: 1542

Answers (1)

Codo
Codo

Reputation: 78795

I like this code. It's surprisingly short. Most of the code just deals with decoding multi-byte sequences into codepoints. Once a codepoint has been decoded, the conversion to ISO-8859-1 is very simple:

  • If it's less or equal 255, it's also a valid ISO-8859-1 character: out.append(1, static_cast<char>(codepoint));
  • If not, it cannot be represented in ISO-8859-1 and is replaced with a question mark: out.append("?");

So to make it work for ISO-8859-15, more code is needed to handle the characters that have been replaced when ISO-8859-15 was introduced (see Comparing ISO-8859-1 and ISO-8859-15). Unfortunately, it considerably increases the code size.

The below code is supposed to be easy to understand. It can be optimized for better performance if that's a main concern.

std::string UTF8toISO8859_1(const char * in) {
    std::string out;
    if (in == NULL)
        return out;

    unsigned int codepoint;
    while (*in != 0) {
        unsigned char ch = static_cast<unsigned char>(*in);
        if (ch <= 0x7f)
            codepoint = ch;
        else if (ch <= 0xbf)
            codepoint = (codepoint << 6) | (ch & 0x3f);
        else if (ch <= 0xdf)
            codepoint = ch & 0x1f;
        else if (ch <= 0xef)
            codepoint = ch & 0x0f;
        else
            codepoint = ch & 0x07;
        ++in;

        if (((*in & 0xc0) != 0x80) && (codepoint <= 0x10ffff)) {
            // a valid codepoint has been decoded; convert it to ISO-8859-15               
            char outc;
            if (codepoint <= 255) {
                // codepoints up to 255 can be directly converted wit a few exceptions
                if (codepoint != 0xa4 && codepoint != 0xa6 && codepoint != 0xa8
                        && codepoint != 0xb4 && codepoint != 0xb8 && codepoint != 0xbc
                        && codepoint != 0xbd && codepoint != 0xbe) {
                    outc = static_cast<char>(codepoint);
                }
                else {
                    outc = '?';
                }
            }
            else {
                // With a few exceptions, codepoints above 255 cannot be converted
                if (codepoint == 0x20AC) {
                    outc = 0xa4;
                }
                else if (codepoint == 0x0160) {
                    outc = 0xa6;
                }
                else if (codepoint == 0x0161) {
                    outc = 0xa8;
                }
                else if (codepoint == 0x017d) {
                    outc = 0xb4;
                }
                else if (codepoint == 0x017e) {
                    outc = 0xb8;
                }
                else if (codepoint == 0x0152) {
                    outc = 0xbc;
                }
                else if (codepoint == 0x0153) {
                    outc = 0xbd;
                }
                else if (codepoint == 0x0178) {
                    outc = 0xbe;
                }
                else {
                    outc = '?';
                }
            }
            out.append(1, outc);
        }
    }
    return out;
}

Upvotes: 2

Related Questions