Reputation: 204766
In my C++ program I want to convert a std:string like this:
abc €
to an UTF-8 escape sequence:
abc%20%E2%82%AC
And I need it to be platform independent! All I found has been solutions only working on windows. There must be a solution out there right?
Upvotes: 1
Views: 4451
Reputation: 279255
Prior to C++11, there's no mandated support for UTF-8 in the standard.
There are two steps here:
Neither of them is particularly difficult to write for yourself portably, assuming you know what character encoding the input string uses[*]. Which means other people have done it before, you shouldn't need to write it yourself. If you search for them separately you might have better luck finding platform-independent code for each step.
Note there are two different ways to URL-escape a space character, either as +
or as %20
. Your example uses %20
, so if that's important to you then don't accidentally use a URL-escape routine that does the other.
[*]
It's not ISO-Latin-1, since that doesn't have the Euro sign[**], but it might be Windows CP-1252.
[**]
Unless it's been added recently. Anyway, your example codes the Euro sign as UTF-8 bytes 0xE2 0x82 0xAC
, which represent the Unicode code point 0x20AC
, not code point 0x80
which it has in CP1252. So if it was originally a single-byte encoding then clearly an intelligent single-byte-to-unicode-code-point conversion has been applied along the way. You could say there are three steps:
std::string
to Unicode code points (depends on input encoding).Upvotes: 3
Reputation: 201
For platform independent feature-rich Unicode handling "de facto" standard library is ICU that is used by many fortune 500 companies and open-source projects... The license is open-source and friendly for use in commercial development
It could be overkill if you just want to use some simple conversions though...
If you just need a simple portable utf-8 c++ library you can try http://utfcpp.sourceforge.net
hth
Upvotes: 2
Reputation: 153929
It seems rather straightforward to me. Your string is a sequence of
bytes. Certain byte values (most, actually, but not the most common)
are not permitted, and should be replaced with the three character
sequence '%'
followed by two hex characters representing the byte
value. So something like:
std::string
toEscaped( std::string const& original )
{
std::string results ;
for ( std::string::const_iterator iter = original.begin();
iter != original.end();
++ iter ) {
static bool const allowed[] =
{
// Define the 256 entries...
};
if ( allowed[static_cast<unsigned char>(*iter)] ) {
results += *iter;
} else {
static char const hexChars[] = "0123456789ABCDEF";
results += '%';
results += hexChars[(*iter >> 4) & 0x0F];
results += hexChars[(*iter ) & 0x0F];
}
}
return results;
}
should do the trick.
Upvotes: 4