Reputation: 8033
I have encountered an interesting issue on Windows 8. I tested I can represent Unicode characters which are out of the BMP with wchar_t* strings. The following test code produced unexpected results for me:
const wchar_t* s1 = L"a";
const wchar_t* s2 = L"\U0002008A"; // The "Han" character
int i1 = sizeof(wchar_t); // i1 == 2, the size of wchar_t on Windows.
int i2 = sizeof(s1); // i2 == 4, because of the terminating '\0' (I guess).
int i3 = sizeof(s2); // i3 == 4, why?
The U+2008A is the Han character, which is out of the Binary Multilingual Pane, so it should be represented by a surrogate pair in UTF-16. Which means - if I understand it correctly - that it should be represented by two wchar_t characters. So I expected sizeof(s2) to be 6 (4 for the two wchar_t-s of the surrogate pair and 2 for the terminating \0).
So why is sizeof(s2) == 4? I tested that the s2 string has been constructed correctly, because I've rendered it with DirectWrite, and the Han character was displayed correctly.
UPDATE: As Naveen pointed out, I tried to determine the size of the arrays incorrectly. The following code produces correct result:
const wchar_t* s1 = L"a";
const wchar_t* s2 = L"\U0002008A"; // The "Han" character
int i1 = sizeof(wchar_t); // i1 == 2, the size of wchar_t on Windows.
std::wstring str1 (s1);
std::wstring str2 (s2);
int i2 = str1.size(); // i2 == 1.
int i3 = str2.size(); // i3 == 2, because two wchar_t characters needed for the surrogate pair.
Upvotes: 4
Views: 6353
Reputation: 153478
Addendum to the answers.
RE: to unravel the different units used in the question's update by i1
and i2
, i3
.
i1
value of 2 is the size in bytes
i2
value of 1 is the size in wchar_t, IOW 4 bytes (assuming sizeof(wchar_t)
is 4).
i3
value of 2 is the size in wchar_t, IOW 8 bytes
Upvotes: 0
Reputation: 73443
sizeof(s2)
returns the number of bytes required to store the pointer s2
or any other pointer, which is 4 bytes on your system. It has nothing to do with the character(s) stored in pointed to by s2
.
Upvotes: 10
Reputation: 596266
sizeof(wchar_t*)
is the same as sizeof(void*)
, in other words the size of a pointer itself. That will always 4 on a 32-bit system, and 8 on a 64-bit system. You need to use wcslen()
or lstrlenW()
instead of sizeof()
:
const wchar_t* s1 = L"a";
const wchar_t* s2 = L"\U0002008A"; // The "Han" character
int i1 = sizeof(wchar_t); // i1 == 2
int i2 = wcslen(s1); // i2 == 1
int i3 = wcslen(s2); // i3 == 2
Upvotes: 5