Reputation: 119219
I understand that std::codecvt<char16_t, char>
in C++11 performs conversion between UTF-16 and UTF-8, and std::codecvt<char32_t, char>
performs conversion between UTF-32 and UTF-8. Is it possible to convert between, say, UTF-8 and ISO 8859-1?
Consider:
const char* s = "\u00C0";
If I print this string and my terminal's encoding is set to UTF-8, I will see the character À
. If I set my terminal's encoding to ISO 8859-1, however, printing that string will not print out the desired character. How would I convert s
into a string that, when printed, will show the character À
if my terminal's encoding is set to ISO 8859-1?
I understand that this can be done with a library such as iconv, but I am curious whether it can be done using only the C++ standard library. I ask this question not because I don't want to use iconv, but because I don't really understand how locales work in C++.
Upvotes: 4
Views: 5427
Reputation: 145279
If you want to convert UTF-8 to ISO 8859-1 using only the facilities of the C++ standard library:
Since this has an answer, while almost any other desired specific encoding would not have an answer, I suspect that the question was constructed in order to be answerable.
The standard library conversions support only one other encoding, namely the unspecified multibyte encoding of the execution character set, via e.g. mbstowcs
(as a matter of formal-pedantic the wide character encoding needs not be Unicode, so formally there is another unspecified encoding, but in practice it's Unicode, i.e. UTF-16 or UTF-32).
I wondered if I should add a code example, but since there’s no interest in this answer (to the question’s “I am curious whether it can be done using only the C++ standard library”) I think it would be wasted effort.
Upvotes: 0
Reputation: 88155
In addition to the standard mandated encodings C++ also supports an implementation defined list of encodings via locales:
#include <locale>
#include <codecvt>
#include <iostream>
template <typename Facet>
struct usable_facet : Facet {
using Facet::Facet;
};
using codecvt = usable_facet<std::codecvt_byname<wchar_t, char, std::mbstate_t>>;
int main() {
std::wstring_convert<codecvt> convert(new codecvt(".1252")); // platform specific locale strings
std::wstring w = convert.from_bytes("\u00C0");
}
Unfortunately one of the things about wchar_t
is that the standard mandates only that it use a fixed width encoding for all locales, but there's no requirement that it use the same encoding in different locales, and so you can't portably convert to wchar_t
using one locale and then convert that back to char
using a different locale.
There is potentially some portable support for such conversions using functions like std::mbrtoc32
and related functions, but these are not yet widely implemented.
I understand that this can be done with a library such as iconv, but I am curious whether it can be done using only the C++ standard library. I ask this question not because I don't want to use iconv, but because I don't really understand how locales work in C++.
The locale library's design doesn't really lend itself to modern usage. C and C++ are themselves confused about encodings vs. character sets, and locales conflate lexical and orthographic issues with computational aspects such as encoding.
How locales work is a topic a bit broader than is suitable for a stackoverflow answer but there are books on the topic. You'd probably also need to read platform specific materials, because the standard doesn't really give any context for much of the functionality. For example the locale library supports message catalogues, but doesn't tell you what they are or how you'd actually make one because that's functionality is not standardized by C++.
Upvotes: 3