iammilind
iammilind

Reputation: 70030

Difference between std::string and std::u16string (or u32string)

I have referred below posts before asking here:

std::string, wstring, u16/32string clarification
std::u16string, std::u32string, std::string, length(), size(), codepoints and characters

But they don't my question. Look at the simple code below:

#include<iostream>
#include<string>
using namespace std;

int main ()
{
  char16_t x[] = { 'a', 'b', 'c', 0 };
  u16string arr = x;

  cout << "arr.length = " << arr.length() << endl;
  for(auto i : arr)
    cout << i << "\n";
}

The output is:

arr.length = 3  // a + b + c
97
98
99

Given that, std::u16string consists of char16_t and not char shouldn't the output be:

arr.length = 2  // ab + c(\0)
<combining 'a' and 'b'>
99

Please excuse me for the novice question. My requirement is to get clear about the concept of new C++11 strings.

Edit:

From @Jonathan's answer, I have got the loophole in my question. My point is that how to initialize the char16_t, so that the length of the arr becomes 2 (i.e. ab, c\0).
FYI, below gives a different result:

  char x[] = { 'a', 'b', 'c', 0 };
  u16string arr = (char16_t*)x;  // probably undefined behavior

Output:

arr.length = 3
25185
99
32767

Upvotes: 4

Views: 11133

Answers (4)

Galik
Galik

Reputation: 48635

When you do:

char16_t x[] = { 'a', 'b', 'c', 0 };

It is similar to doing this (endianness not withstanding):

char x[] = { '\0', 'a', '\0', 'b', '\0', 'c', '\0', '\0' };

Each character occupies two bytes in memory.

So when you ask for the length of a u16string each two bytes is counted as one character. They are, after all, two-byte (16bit) characters.

EDIT:

Your additional question is creating a string without a null terminator.

Try this:

char x[] = { 'a', 'b', 'c', 0 , 0, 0};
u16string arr = (char16_t*)x;

Now the first character is {'a', 'b'} the second character is {'c', 0} and you also have a null terminator character {0, 0}.

Upvotes: 3

R. Martinho Fernandes
R. Martinho Fernandes

Reputation: 234584

C++ supports the following way to build 16-bit integers from 8-bit integers:

char16_t ab = (static_cast<unsigned char>('a') << 8) | 'b';
// (Note: cast to unsigned meant to prevent overflows)

Upvotes: -1

Solkar
Solkar

Reputation: 1226

shouldn't the output be:

arr.length = 2
// ab + c(\0) 99

No. The elements of x are char16_t, regardless of that you provide char literals for initialization:

#include<iostream>

int main () {
    char16_t x[] = { 'a', 'b', 'c', 0 };
    std::cout << sizeof(x[0]) << std::endl;
}

output:

2 

Live example

Addendum, referring to the EDIT of the question

I'd not exactly recommend casting the termination away from strings. ;)

#include<iostream>
#include<string>

int main () {
    char x[] = { 'a', 'b', 'c', 0, 0, 0, 0, 0};

    std::wstring   ws   = reinterpret_cast<wchar_t*>(x);
    std::u16string u16s = reinterpret_cast<char16_t*>(x);

    std::cout << "sizeof(wchar_t):  "       << sizeof(wchar_t)
              << "\twide string length: "   << ws.length()   
              << std::endl;

    std::cout << "sizeof(char16_t): "       << sizeof(char16_t)
               << "\tu16string length:  "   << u16s.length()   
               << std::endl;
}

Live example

output (compiled with g++)

sizeof(wchar_t):  4 wide string length: 1
sizeof(char16_t): 2 u16string length:   2

As expected, isn't it.

Upvotes: 1

Jonathan Wakely
Jonathan Wakely

Reputation: 171383

No, you have created an array of four elements, the first element is 'a' converted to char16_t, the second is 'b' converted to char16_t etc.

Then you create a u16string from that array (converted to a pointer), which reads each element up to the null terminator.

Upvotes: 4

Related Questions