Reputation: 10962
I have some code that reads in an a unicode codepoint (as escaped in a string 0xF00).
Since im using boost, I'm speculating if the following is best (and correct) approach:
unsigned int codepoint{0xF00};
boost::locale::conv::utf_to_utf<char>(&codepoint, &codepoint+1);
?
Upvotes: 4
Views: 2782
Reputation: 74008
After reading about the unsteady state of UTF-8 support in C++, I stumbled upon the corresponding C support c32rtomb
, which looks promising, and likely won't be deprecated any time soon
#include <clocale>
#include <cuchar>
#include <climits>
size_t to_utf8(char32_t codepoint, char *buf)
{
const char *loc = std::setlocale(LC_ALL, "en_US.utf8");
std::mbstate_t state{};
std::size_t len = std::c32rtomb(buf, codepoint, &state);
std::setlocale(LC_ALL, loc);
return len;
}
Usage would then be
char32_t codepoint{0xfff};
char buf[MB_LEN_MAX]{};
size_t len = to_utf8(codepoint, buf);
If your application's current locale is already UTF-8, you might omit the back and forth call to setlocale
of course.
Upvotes: 2
Reputation: 148870
C++17 has deprecated number of convenience functions processing utf. Unfortunately, the last remaining ones will be deprecated in C++20 (*). That being said std::codecvt
is still valid. From C++11 to C++17, you can use a std::codecvt<char32_t, char, mbstate_t>
, starting with C++20 it will be std::codecvt<char32_t, char8_t, mbstate_t>
.
Here is some code converting a code point (up to 0x10FFFF) in utf8:
// codepoint is the codepoint to convert
// buff is a char array of size sz (should be at least 4 to convert any code point)
// on return sz is the used size of buf for the utf8 converted string
// the return value is the return value of std::codecvt::out (0 for ok)
std::codecvt_base::result to_utf8(char32_t codepoint, char *buf, size_t& sz) {
std::locale loc("");
const std::codecvt<char32_t, char, std::mbstate_t> &cvt =
std::use_facet<std::codecvt<char32_t, char, std::mbstate_t>>(loc);
std::mbstate_t state{{0}};
const char32_t * last_in;
char *last_out;
std::codecvt_base::result res = cvt.out(state, &codepoint, 1+&codepoint, last_in,
buf, buf+sz, last_out);
sz = last_out - buf;
return res;
}
(*) std::codecvt
will still exist in C++20. Simply the default instantiations will no longer be std::codecvt<char16_t, char, std::mbstate_t>
and std::codecvt<char32_t, char, std::mbstate_t>
but std::codecvt<char16_t, char8_t, std::mbstate_t>
and std::codecvt<char32_t, char8_t, std::mbstate_t>
(note char8_t
instead of char
)
Upvotes: 4
Reputation: 385088
As mentioned, a codepoint in this form is (conveniently) UTF-32, so what you're looking for is a transcoding.
For a solution that does not rely on functions deprecated since C++17, and isn't really ugly, and which also does not require hefty third-party libraries, you can use the very lightweight UTF8-CPP (four small headers!) and its function utf8::utf32to8
.
It's going to look something like this:
const uint32_t codepoint{0xF00};
std::vector<unsigned char> result;
try
{
utf8::utf32to8(&codepoint, &codepoint + 1, std::back_inserter(result));
}
catch (const utf8::invalid_code_point&)
{
// something
}
(There's also a utf8::unchecked::utf32to8
, if you're allergic to exceptions.)
(And consider reading into vector<char8_t>
or std::u8string
, since C++20).
(Finally, note that I've specifically used uint32_t
to ensure the input has the proper width.)
I tend to use this library in projects until I need something a little heavier for other purposes (at which point I'll typically switch to ICU).
Upvotes: 5
Reputation: 40791
You can do this with the standard library using std::wstring_convert
to convert UTF-32 (code points) to UTF-8:
#include <locale>
#include <codecvt>
std::string codepoint_to_utf8(char32_t codepoint) {
std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> convert;
return convert.to_bytes(&codepoint, &codepoint + 1);
}
This returns a std::string
whose size is 1, 2, 3 or 4 depending on how large codepoint
is. It will throw a std::range_error
if the code point is too large (> 0x10FFFF, the max unicode code point).
Your version with boost seems to be doing the same thing. The documentation says that the utf_to_utf
function converts a UTF encoding to another one, in this case 32 to 8. If you use char32_t
, it will be a "correct" approach, that will work on systems where unsigned int
isn't the same size as char32_t
.
// The function also converts the unsigned int to char32_t
std::string codepoint_to_utf8(char32_t codepoint) {
return boost::locale::conv::utf_to_utf<char>(&codepoint, &codepoint + 1);
}
Upvotes: 7