gordonwd
gordonwd

Reputation: 4587

Convert ISO-8859-1 strings to UTF-8 in C/C++

You would think this would be readily available, but I'm having a hard time finding a simple library function that will convert a C or C++ string from ISO-8859-1 coding to UTF-8. I'm reading data that is in 8-bit ISO-8859-1 encoding, but need to convert it to a UTF-8 string for use in an SQLite database and eventually an Android app.

I found one commercial product, but it's beyond my budget at this time.

Upvotes: 26

Views: 51949

Answers (7)

o0Evolved0o
o0Evolved0o

Reputation: 11

The code

isolat1ToUTF8(unsigned char* out, int *outlen,
              const unsigned char* in, int *inlen) {
    unsigned char* outstart = out;
    const unsigned char* base = in;
    const unsigned char* processed = in;
    unsigned char* outend = out + *outlen;
    const unsigned char* inend;
    unsigned int c;
    int bits;

    inend = in + (*inlen);
    while ((in < inend) && (out - outstart + 5 < *outlen)) {
    c= *in++;

    /* assertion: c is a single UTF-4 value */
        if (out >= outend)
        break;
        if      (c <    0x80) {  *out++=  c;                bits= -6; }
        else                  {  *out++= ((c >>  6) & 0x1F) | 0xC0;  bits=  0; }
 
        for ( ; bits >= 0; bits-= 6) {
            if (out >= outend)
            break;
            *out++= ((c >> bits) & 0x3F) | 0x80;
        }
    processed = (const unsigned char*) in;
    }
    *outlen = out - outstart;
    *inlen = processed - base;
    return(0);
}

I think this could be helpfull! And sorry for my last comment what was deleted! I can give you the link if needed there is a full explanation in a .c file. I have got this from it. Cheers!

Upvotes: -1

Spacemoose
Spacemoose

Reputation: 4016

You can use the boost::locale library:

http://www.boost.org/doc/libs/1_49_0/libs/locale/doc/html/charset_handling.html

The code would look like this:

#include <boost/locale.hpp>
std::string utf8_string = boost::locale::conv::to_utf<char>(latin1_string,"Latin1");

Upvotes: 6

cytrinox
cytrinox

Reputation: 1946

The C++03 standard does not provide functions to directly convert between specific charsets.

Depending on your OS, you can use iconv() on Linux, MultiByteToWideChar() & Co. on Windows. A library which provides large support for string conversion is the ICU library which is open source.

Upvotes: 3

Lord Raiden
Lord Raiden

Reputation: 331

To c++ i use this:

std::string iso_8859_1_to_utf8(std::string &str)
{
    string strOut;
    for (std::string::iterator it = str.begin(); it != str.end(); ++it)
    {
        uint8_t ch = *it;
        if (ch < 0x80) {
            strOut.push_back(ch);
        }
        else {
            strOut.push_back(0xc0 | ch >> 6);
            strOut.push_back(0x80 | (ch & 0x3f));
        }
    }
    return strOut;
}

Upvotes: 21

RBerteig
RBerteig

Reputation: 43326

The Unicode folks have some tables that might help if faced with Windows 1252 instead of true ISO-8859-1. The definitive one seems to be this one which maps every code point in CP1252 to a code point in Unicode. Encoding the Unicode as UTF-8 is a straightforward exercise.

It would not be difficult to parse that table directly and form a lookup table from it at compile time.

Upvotes: 2

R.. GitHub STOP HELPING ICE
R.. GitHub STOP HELPING ICE

Reputation: 215259

If your source encoding will always be ISO-8859-1, this is trivial. Here's a loop:

unsigned char *in, *out;
while (*in)
    if (*in<128) *out++=*in++;
    else *out++=0xc2+(*in>0xbf), *out++=(*in++&0x3f)+0x80;

For safety you need to ensure that the output buffer is twice as large as the input buffer, or else include a size limit and check it in the loop condition.

Upvotes: 41

Cheers and hth. - Alf
Cheers and hth. - Alf

Reputation: 145269

ISO-8859-1 to UTF-8 involves nothing more than the encoding algorithm because ISO-8859-1 is a subset of Unicode. So you already have the Unicode code points. Check Wikipedia for the algorithm.

The C++ aspects -- integrating that with iostreams -- are much harder.

I suggest you walk around that mountain instead of trying to drill through it or climb it, that is, implement a simple string to string converter.

Cheers & hth.,

Upvotes: -2

Related Questions