user230910
user230910

Reputation: 2372

Simplify a c++ expression in string encoding answer

In this question: Convert ISO-8859-1 strings to UTF-8 in C/C++

There is a really nice concise piece of c++ code that converts ISO-8859-1 strings to UTF-8.

In this answer: https://stackoverflow.com/a/4059934/3426514

I'm still a beginner at c++ and I'm struggling to understand how this works. I have read up on the encoding sequences of UTF-8, and I understand that <128 the chars are the same, and above 128 the first byte gets a prefix and the rest of the bits are spread over a couple of bytes starting with 10xx, but I see no bit shifting in this answer.

If someone could help me to decompose it into a function that only processes 1 character, it would really help me understand.

Upvotes: 2

Views: 209

Answers (1)

DevSolar
DevSolar

Reputation: 70263

Code, commented.

This works on the fact that Latin-1 0x00 through 0xff are mapping to consecutive UTF-8 code sequences 0x00-0x7f, 0xc2 0x80-bf, 0xc3 0x80-bf.

// converting one byte (latin-1 character) of input
while (*in)
{
    if ( *in < 0x80 )
    {
        // just copy
        *out++ = *in++;
    }
    else
    {
         // first byte is 0xc2 for 0x80-0xbf, 0xc3 for 0xc0-0xff
         // (the condition in () evaluates to true / 1)
         *out++ = 0xc2 + ( *in > 0xbf ),

         // second byte is the lower six bits of the input byte
         // with the highest bit set (and, implicitly, the second-
         // highest bit unset)
         *out++ = ( *in++ & 0x3f ) + 0x80;
    }
}

The problem with a function processing a single (input) character is that the output could be either one or two bytes, making the function a bit awkward to use. You are usually better off (both in performance and cleanliness of code) with processing whole strings.

Note that the assumption of Latin-1 as input encoding is very likely to be wrong. For example, Latin-1 doesn't have the Euro sign (), or any of these characters ŠšŽžŒœŸ, which makes most people in Europe use either Latin-9 or CP-1252, even if they are not aware of it. ("Encoding? No idea. Latin-1? Yea, that sounds about right.")

All that being said, that's the C way to do it. The C++ way would (probably, hopefully) look more like this:

#include <unistr.h>
#include <bytestream.h>

// ...

icu::UnicodeString ustr( in, "ISO-8859-1" );

// ...work with a properly Unicode-aware string class...

// ...convert to UTF-8 if necessary.
char * buffer[ BUFSIZE ];
icu::CheckedArrayByteSink bs( buffer, BUFSIZE );
ustr.toUTF8( bs );

That is using the International Components for Unicode (ICU) library. Note the ease this is adopted to a different input encoding. Different output encodings, iostream operators, character iterators, and even a C API are readily available from the library.

Upvotes: 1

Related Questions