Reputation: 4587
You would think this would be readily available, but I'm having a hard time finding a simple library function that will convert a C or C++ string from ISO-8859-1 coding to UTF-8. I'm reading data that is in 8-bit ISO-8859-1 encoding, but need to convert it to a UTF-8 string for use in an SQLite database and eventually an Android app.
I found one commercial product, but it's beyond my budget at this time.
Upvotes: 26
Views: 51949
Reputation: 11
isolat1ToUTF8(unsigned char* out, int *outlen,
const unsigned char* in, int *inlen) {
unsigned char* outstart = out;
const unsigned char* base = in;
const unsigned char* processed = in;
unsigned char* outend = out + *outlen;
const unsigned char* inend;
unsigned int c;
int bits;
inend = in + (*inlen);
while ((in < inend) && (out - outstart + 5 < *outlen)) {
c= *in++;
/* assertion: c is a single UTF-4 value */
if (out >= outend)
break;
if (c < 0x80) { *out++= c; bits= -6; }
else { *out++= ((c >> 6) & 0x1F) | 0xC0; bits= 0; }
for ( ; bits >= 0; bits-= 6) {
if (out >= outend)
break;
*out++= ((c >> bits) & 0x3F) | 0x80;
}
processed = (const unsigned char*) in;
}
*outlen = out - outstart;
*inlen = processed - base;
return(0);
}
I think this could be helpfull! And sorry for my last comment what was deleted! I can give you the link if needed there is a full explanation in a .c file. I have got this from it. Cheers!
Upvotes: -1
Reputation: 4016
You can use the boost::locale library:
http://www.boost.org/doc/libs/1_49_0/libs/locale/doc/html/charset_handling.html
The code would look like this:
#include <boost/locale.hpp>
std::string utf8_string = boost::locale::conv::to_utf<char>(latin1_string,"Latin1");
Upvotes: 6
Reputation: 1946
The C++03 standard does not provide functions to directly convert between specific charsets.
Depending on your OS, you can use iconv() on Linux, MultiByteToWideChar() & Co. on Windows. A library which provides large support for string conversion is the ICU library which is open source.
Upvotes: 3
Reputation: 331
To c++ i use this:
std::string iso_8859_1_to_utf8(std::string &str)
{
string strOut;
for (std::string::iterator it = str.begin(); it != str.end(); ++it)
{
uint8_t ch = *it;
if (ch < 0x80) {
strOut.push_back(ch);
}
else {
strOut.push_back(0xc0 | ch >> 6);
strOut.push_back(0x80 | (ch & 0x3f));
}
}
return strOut;
}
Upvotes: 21
Reputation: 43326
The Unicode folks have some tables that might help if faced with Windows 1252 instead of true ISO-8859-1. The definitive one seems to be this one which maps every code point in CP1252 to a code point in Unicode. Encoding the Unicode as UTF-8 is a straightforward exercise.
It would not be difficult to parse that table directly and form a lookup table from it at compile time.
Upvotes: 2
Reputation: 215259
If your source encoding will always be ISO-8859-1, this is trivial. Here's a loop:
unsigned char *in, *out;
while (*in)
if (*in<128) *out++=*in++;
else *out++=0xc2+(*in>0xbf), *out++=(*in++&0x3f)+0x80;
For safety you need to ensure that the output buffer is twice as large as the input buffer, or else include a size limit and check it in the loop condition.
Upvotes: 41
Reputation: 145269
ISO-8859-1 to UTF-8 involves nothing more than the encoding algorithm because ISO-8859-1 is a subset of Unicode. So you already have the Unicode code points. Check Wikipedia for the algorithm.
The C++ aspects -- integrating that with iostreams -- are much harder.
I suggest you walk around that mountain instead of trying to drill through it or climb it, that is, implement a simple string to string converter.
Cheers & hth.,
Upvotes: -2