Reputation: 17920
How to convert back and forth between a Unicode/UCS codepoint and a UTF16 surrogate pair in C++14 and later?
EDIT: Removed mention of UCS-2 surrogates, as there is no such thing. Thanks @remy-lebeau!
Upvotes: 0
Views: 1449
Reputation: 11
Or in assembler, you could simply do it this way (where the number is in the eax register):
SUB EAX,10000h ;convert to 20-bit number in the range 0-0FFFFFh
MOV EDX,EAX ;keep number in edx
SHR EAX,10 ;move the top 10 bits into AX
ADD EAX,0D800h ;get the first surrogate in the range 0D800h to 0DBFFh
STOSW ;use that in destination
MOV EAX,EDX ;restore number from earlier
AND EAX,3FFh ;keep only the bottom 10 bits
ADD EAX,0DC00h ;get the second surrogate in the range 0DC00h to 0DFFFh
STOSW ;use that in destination
Upvotes: 0
Reputation: 598134
In C++11 and later, you can use std::wstring_convert
to convert between various UTF/UCS encodings, using the following std::codecvt
types:
UTF-8 <-> UCS-2:
std::codecvt_utf8<char16_t>
UTF-8 <-> UTF-16:
std::codecvt_utf8_utf16
UTF-8 <-> UTF-32/UCS-4:
std::codecvt_utf8<char32_t>
UCS-2 <-> UTF-16:
std::codecvt_utf16<char16_t>
UTF-16 <-> UTF-32/UCS-4:
std::codecvt_utf16<char32_t>
UCS-2 <-> UTF-32/UCS-4:
no standard conversion, but you can write your own std::codecvt
class for it if needed. Otherwise, use one of the above conversions in between:
UCS-2 <-> UTF-X <-> UTF-32/UCS-4
You don't need to handle surrogates manually.
You can use std::u32string
to hold your codepoint(s), and std::u16string
to hold your UTF-16/UCS-2 codeunits.
For example:
using convert_utf16_uf32 = std::wstring_convert<std::codecvt_utf16<char32_t>, char16_t>;
std::u16string CodepointToUTF16(const char32_t codepoint)
{
const char32_t *p = &codepoint;
return convert_utf16_uf32{}.from_bytes(
reinterpret_cast<const char*>(p),
reinterpret_cast<const char*>(p+1)
);
}
std::u16string UTF32toUTF16(const std::u32string &str)
{
return convert_utf16_uf32{}.from_bytes(
reinterpret_cast<const char*>(str.data()),
reinterpret_cast<const char*>(str.data()+str.size())
);
}
char32_t UTF16toCodepoint(const std::u16string &str)
{
std::string bytes = convert_utf16_uf32{}.to_bytes(str);
return *(reinterpret_cast<const char32_t*>(bytes.data()));
}
std::u32string UTF16toUTF32(const std::u16string &str)
{
std::string bytes = convert_utf16_uf32{}.to_bytes(str);
return std::u32string(
reinterpret_cast<const char32_t*>(bytes.data()),
bytes.size() / sizeof(char32_t)
);
}
Upvotes: 5
Reputation: 17920
The surrogate-pairs tag info page explains (better than specified by the Unicode Standard 9.0 in §3.9, Table 3-5.) the algorithm to convert from codepoint to surrogate pair as follows:
Unicode characters outside the Basic Multilingual Plane, that is characters with code above 0xFFFF, are encoded in UTF-16 by pairs of 16-bit code units called surrogate pairs, by the following scheme:
- 0x010000 is subtracted from the code point, leaving a 20-bit number in the range 0..0x0FFFFF;
- the top ten bits (a number in the range 0..0x03FF) are added to 0xD800 to give the first code unit or high surrogate, which will be in the range 0xD800..0xDBFF;
- the low ten bits (also in the range 0..0x03FF) are added to 0xDC00 to give the second code unit or low surrogate, which will be in the range 0xDC00..0xDFFF.
In C++14 and later this could be written as:
#include <cstdint>
using codepoint = std::uint32_t;
using utf16 = std::uint16_t;
struct surrogate {
utf16 high; // Leading
utf16 low; // Trailing
};
constexpr surrogate split(codepoint const in) noexcept {
auto const inMinus0x10000 = (in - 0x10000);
surrogate const r{
static_cast<utf16>((inMinus0x10000 / 0x400) + 0xd800), // High
static_cast<utf16>((inMinus0x10000 % 0x400) + 0xdc00)}; // Low
return r;
}
In the reverse direction one just has to combine the last 10 bits from the high surrogate and the last 10 bits from the low surrogate, and add 0x10000
:
constexpr codepoint combine(surrogate const s) noexcept {
return static_cast<codepoint>(
((s.high - 0xd800) * 0x400) + (s.low - 0xdc00) + 0x10000);
}
Here's a test for these conversions:
#include <cassert>
constexpr bool isValidUtf16Surrogate(utf16 v) noexcept
{ return (v & 0xf800) == 0xd800; }
constexpr bool isValidCodePoint(codepoint v) noexcept {
return (v <= 0x10ffff)
&& ((v >= 0x10000) || !isValidUtf16Surrogate(static_cast<utf16>(v)));
}
constexpr bool isValidUtf16HighSurrogate(utf16 v) noexcept
{ return (v & 0xfc00) == 0xd800; }
constexpr bool isValidUtf16LowSurrogate(utf16 v) noexcept
{ return (v & 0xfc00) == 0xdc00; }
constexpr bool codePointNeedsUtf16Surrogates(codepoint v) noexcept
{ return (v >= 0x10000) && (v <= 0x10ffff); }
void test(codepoint const in) {
assert(isValidCodePoint(in));
assert(codePointNeedsUtf16Surrogates(in));
auto const s = split(in);
assert(isValidUtf16HighSurrogate(s.high));
assert(isValidUtf16LowSurrogate(s.low));
auto const out = combine(s);
assert(isValidCodePoint(out));
assert(in == out);
}
int main() {
for (codepoint c = 0x10000; c <= 0x10ffff; ++c)
test(c);
}
Upvotes: 8