juergen d
juergen d

Reputation: 204766

Convert string to UTF-8 escape sequence

In my C++ program I want to convert a std:string like this:

abc €

to an UTF-8 escape sequence:

abc%20%E2%82%AC

And I need it to be platform independent! All I found has been solutions only working on windows. There must be a solution out there right?

Upvotes: 1

Views: 4451

Answers (3)

Steve Jessop
Steve Jessop

Reputation: 279255

Prior to C++11, there's no mandated support for UTF-8 in the standard.

There are two steps here:

  • convert to UTF-8 (unless it's already in UTF-8)
  • URL-escape the result (update: James Kanze covers this part)

Neither of them is particularly difficult to write for yourself portably, assuming you know what character encoding the input string uses[*]. Which means other people have done it before, you shouldn't need to write it yourself. If you search for them separately you might have better luck finding platform-independent code for each step.

Note there are two different ways to URL-escape a space character, either as + or as %20. Your example uses %20, so if that's important to you then don't accidentally use a URL-escape routine that does the other.

[*] It's not ISO-Latin-1, since that doesn't have the Euro sign[**], but it might be Windows CP-1252.

[**] Unless it's been added recently. Anyway, your example codes the Euro sign as UTF-8 bytes 0xE2 0x82 0xAC, which represent the Unicode code point 0x20AC, not code point 0x80 which it has in CP1252. So if it was originally a single-byte encoding then clearly an intelligent single-byte-to-unicode-code-point conversion has been applied along the way. You could say there are three steps:

  • convert the std::string to Unicode code points (depends on input encoding).
  • convert the Unicode to UTF-8
  • URL-escape the UTF-8

Upvotes: 3

jvaz
jvaz

Reputation: 201

For platform independent feature-rich Unicode handling "de facto" standard library is ICU that is used by many fortune 500 companies and open-source projects... The license is open-source and friendly for use in commercial development

It could be overkill if you just want to use some simple conversions though...

http://site.icu-project.org

If you just need a simple portable utf-8 c++ library you can try http://utfcpp.sourceforge.net

hth

Upvotes: 2

James Kanze
James Kanze

Reputation: 153929

It seems rather straightforward to me. Your string is a sequence of bytes. Certain byte values (most, actually, but not the most common) are not permitted, and should be replaced with the three character sequence '%' followed by two hex characters representing the byte value. So something like:

std::string
toEscaped( std::string const& original )
{
    std::string results ;
    for ( std::string::const_iterator iter = original.begin();
            iter != original.end();
            ++ iter ) {
        static bool const allowed[] =
        {
            //  Define the 256 entries...
        };
        if ( allowed[static_cast<unsigned char>(*iter)] ) {
            results += *iter;
        } else {
            static char const hexChars[] = "0123456789ABCDEF";
            results += '%';
            results += hexChars[(*iter >> 4) & 0x0F];
            results += hexChars[(*iter     ) & 0x0F];
        }
    }
    return results;
}

should do the trick.

Upvotes: 4

Related Questions