Reputation: 2372
In this question: Convert ISO-8859-1 strings to UTF-8 in C/C++
There is a really nice concise piece of c++ code that converts ISO-8859-1 strings to UTF-8.
In this answer: https://stackoverflow.com/a/4059934/3426514
I'm still a beginner at c++ and I'm struggling to understand how this works. I have read up on the encoding sequences of UTF-8, and I understand that <128 the chars are the same, and above 128 the first byte gets a prefix and the rest of the bits are spread over a couple of bytes starting with 10xx, but I see no bit shifting in this answer.
If someone could help me to decompose it into a function that only processes 1 character, it would really help me understand.
Upvotes: 2
Views: 209
Reputation: 70263
Code, commented.
This works on the fact that Latin-1 0x00 through 0xff are mapping to consecutive UTF-8 code sequences 0x00-0x7f, 0xc2 0x80-bf, 0xc3 0x80-bf.
// converting one byte (latin-1 character) of input
while (*in)
{
if ( *in < 0x80 )
{
// just copy
*out++ = *in++;
}
else
{
// first byte is 0xc2 for 0x80-0xbf, 0xc3 for 0xc0-0xff
// (the condition in () evaluates to true / 1)
*out++ = 0xc2 + ( *in > 0xbf ),
// second byte is the lower six bits of the input byte
// with the highest bit set (and, implicitly, the second-
// highest bit unset)
*out++ = ( *in++ & 0x3f ) + 0x80;
}
}
The problem with a function processing a single (input) character is that the output could be either one or two bytes, making the function a bit awkward to use. You are usually better off (both in performance and cleanliness of code) with processing whole strings.
Note that the assumption of Latin-1 as input encoding is very likely to be wrong. For example, Latin-1 doesn't have the Euro sign (€
), or any of these characters ŠšŽžŒœŸ
, which makes most people in Europe use either Latin-9 or CP-1252, even if they are not aware of it. ("Encoding? No idea. Latin-1? Yea, that sounds about right.")
All that being said, that's the C way to do it. The C++ way would (probably, hopefully) look more like this:
#include <unistr.h>
#include <bytestream.h>
// ...
icu::UnicodeString ustr( in, "ISO-8859-1" );
// ...work with a properly Unicode-aware string class...
// ...convert to UTF-8 if necessary.
char * buffer[ BUFSIZE ];
icu::CheckedArrayByteSink bs( buffer, BUFSIZE );
ustr.toUTF8( bs );
That is using the International Components for Unicode (ICU) library. Note the ease this is adopted to a different input encoding. Different output encodings, iostream operators, character iterators, and even a C API are readily available from the library.
Upvotes: 1