Reputation: 40170

What exactly can wchar_t represent?

According to cppreference.com's doc on wchar_t:

wchar_t - type for wide character representation (see wide strings). Required to be large enough to represent any supported character code point (32 bits on systems that support Unicode. A notable exception is Windows, where wchar_t is 16 bits and holds UTF-16 code units) It has the same size, signedness, and alignment as one of the integer types, but is a distinct type.

The Standard says in [basic.fundamental]/5:

Type wchar_t is a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales. Type wchar_t shall have the same size, signedness, and alignment requirements as one of the other integral types, called its underlying type. Types char16_t and char32_t denote distinct types with the same size, signedness, and alignment as uint_least16_t and uint_least32_t, respectively, in <cstdint>, called the underlying types.

So, if I want to deal with unicode characters, should I use wchar_t?

Equivalently, how do I know if a specific unicode character is "supported" by wchar_t?

Upvotes: 24

Answers (5)

catnip

Reputation: 25418

It all depends what you mean by 'deal with', but one thing is for sure: where Unicode is concerned std::basic_string doesn't provide any real functionality at all.

In any particular program, you will need to perform X number of Unicode-aware operations, e.g. intelligent string matching, case folding, regex, locating word breaks, using a Unicode string as a path name maybe, and so on.

Supporting these operations there will almost always be some kind of library and / or native API provided by the platform, and the goal for me would be to store and manipulate my strings in such a way that these operations can be carried out without scattering knowledge of the underlying library and native API support throughout the code any more than necessary. I'd also want to future-proof myself as to the width of the characters I store in my strings in case I change my mind.

Suppose, for example, you decide to use ICU to do the heavy lifting. Immediately there is an obvious problem: an icu::UnicodeString is not related in any way to std::basic_string. What to do? Work exclusively with icu::UnicodeString throughout the code? Probably not.

Or maybe the focus of the application switches from European languages to Asian ones, so that UTF-16 becomes (perhaps) a better choice than UTF-8.

So, my choice would be to use a custom string class derived from std::basic_string, something like this:

typedef wchar_t mychar_t;  // say

class MyString : public std::basic_string <mychar_t>
{
...
};

Straightaway you have flexibility in choosing the size of the code units stored in your container. But you can do much more than that. For example, with the above declaration (and after you add in boilerplate for the various constructors that you need to provide to forward them to std::basic_string), you still cannot say:

MyString s = "abcde";

Because "abcde" is a narrow string and various the constructors for std::basic_string <wchar_t> all expect a wide string. Microsoft solve this with a macro (TEXT ("...") or __T ("...")), but that is a pain. All we need to do now is to provide a suitable constructor in MyString, with signature MyString (const char *s), and the problem is solved.

In practise, this constructor would probably expect a UTF-8 string, regardless of the underlying character width used for MyString, and convert it if necessary. Someone comments here somewhere that you should store your strings as UTF-8 so that you can construct them from UTF-8 literals in your code. Well now we have broken that constraint. The underlying character width of our strings can be anything we like.

Another thing that people have been talking about in this thread is that find_first_of may not work properly for UTF-8 strings (and indeed some UTF-16 ones also). Well, now you can provide an implementation that does the job properly. Should take about half an hour. If there are other 'broken' implementations in std::basic_string (and I'm sure there are), then most of them can probably be replaced with similar ease.

As for the rest, it mainly depends what level of abstraction you want to implement in your MyString class. If your application is happy to have a dependency on ICU, for example, then you can just provide a couple of methods to convert to and from an icu::UnicodeString. That's probably what most people would do.

Or if you need to pass UTF-16 strings to / from native Windows APIs then you can add methods to convert to and from const WCHAR * (which again you would implement in such a way that they work for all values of mychar_t). Or you could go further and abstract away some or all of the Unicode support provided by the platform and library you are using. The Mac, for example, has rich Unicode support but it's only available from Objective-C so you have to wrap it. It depends on how portable you want your code to be.

So you can add in whatever functionality you like, probably on an on-going basis as work progresses, without losing the ability to carry your strings around as a std::basic_string. Of one sort or another. Just try not to write code that assumes it knows how wide it is, or that it contains no surrogate pairs.

Upvotes: 4

Luis Colorado

Reputation: 12708

First of all, you should check (as you point out in your question) if you are using Windows and Visual Studio C++ with wchar_t being 16bits, because in that case, to use full unicode support, you'll need to assume UTF-16 encoding.

The basic problem here is not the sizeof wchar_t you are using, but if the libraries you are going to use, support full unicode support.

Java has a similar problem, as its char type is 16bit wide, so it couldn't a priori support full unicode space, but it does, as it uses UTF-16 encoding and the pair surrogates to cope with full 24bit codepoints.

It's also worth to note that UNICODE uses only the high plane to encode rare codepoints, that are not normally used daily.

For unicode support anyway, you need to use wide character sets, so wchar_t is a good beginning. If you are going to work with visual studio, then you have to check how it's libraries deal with unicode characters.

Another thing to note is that standard libraries deal with character sets (and this includes unicode) only when you add locale support (this requires some library to be initialized, e.g. setlocale(3)) and so, you'll see no unicode at all (only basic ascii) in cases where you have not called setlocale(3).

There are wide char functions for almost any str*(3) function, as well as for any stdio.h library function, to deal with wchar_ts. A little dig into the /usr/include/wchar.h file will reveal the names of the routines. Go to the manual pages for documentation on them: fgetws(3), fputwc(3), fputws(3), fwide(3), fwprintf(3), ...

Finally, consider again that, if you are dealing with Microsoft Visual C++, you have a different implementation from the beginning. Even if they cope to be completely standard compliant, you'll have to cope with some idiosyncrasies of having a different implementation. Probably you'll have different function names for some uses.

Upvotes: 3

Barmak Shemirani

Reputation: 31669

wchar_t is used in Windows which uses UTF16-LE format. wchar_t requires wide char functions. For example wcslen(const wchar_t*) instead of strlen(const char*) and std::wstring instead of std::string

Unix based machines (Linux, Mac, etc.) use UTF8. This uses char for storage, and the same C and C++ functions for ASCII, such as strlen(const char*) and std::string (see comments below about std::find_first_of)

wchar_t is 2 bytes (UTF16) in Windows. But in other machines it is 4 bytes (UTF32). This makes things more confusing.

For UTF32, you can use std::u32string which is the same on different systems.

You might consider converting UTF8 to UTF32, because that way each character is always 4 bytes, and you might think string operations will be easier. But that's rarely necessary.

UTF8 is designed so that ASCII characters between 0 and 128 are not used to represent other Unicode code points. That includes escape sequence '\', printf format specifiers, and common parsing characters like ,

Consider the following UTF8 string. Lets say you want to find the comma

std::string str = u8"汉,🙂"; //3 code points represented by 8 bytes

The ASCII value for comma is 44, and str is guaranteed to contain only one byte whose value is 44. To find the comma, you can simply use any standard function in C or C++ to look for ','

To find 汉, you can search for the string u8"汉" since this code point cannot be represented as a single character.

Some C and C++ functions don't work smoothly with UTF8. These include

strtok
strspn
std::find_first_of

The argument for above functions is a set of characters, not an actual string.

So str.find_first_of(u8"汉") does not work. Because u8"汉" is 3 bytes, and find_first_of will look for any of those bytes. There is a chance that one of those bytes are used to represent a different code point.

On the other hand, str.find_first_of(u8",;abcd") is safe, because all the characters in the search argument are ASCII (str itself can contain any Unicode character)

In rare cases UTF32 might be required (although I can't imagine where!) You can use std::codecvt to convert UTF8 to UTF32 to run the following operations:

std::u32string u32 = U"012汉"; //4 code points, represented by 4 elements
cout << u32.find_first_of(U"汉") << endl; //outputs 3
cout << u32.find_first_of(U'汉') << endl; //outputs 3

Side note:

You should use "Unicode everywhere", not "UTF8 everywhere".

In Linux, Mac, etc. use UTF8 for Unicode.

In Windows, use UTF16 for Unicode. Windows programmers use UTF16, they don't make pointless conversions back and forth to UTF8. But there are legitimate cases for using UTF8 in Windows.

Windows programmer tend to use UTF8 for saving files, web pages, etc. So that's less worry for non-Windows programmers in terms of compatibility.

The language itself doesn't care which Unicode format you want to use, but in terms of practicality use a format that matches the system you are working on.

Upvotes: 11

Jodocus

Reputation: 7601

So, if I want to deal with unicode characters, should I use wchar_t?

First of all, note that the encoding does not force you to use any particular type to represent a certain character. You may use char to represent Unicode characters just as wchar_t can - you only have to remember that up to 4 chars together will form a valid code point depending on UTF-8, UTF-16, or UTF-32 encoding, while wchar_t can use 1 (UTF-32 on Linux, etc) or up to 2 working together (UTF-16 on Windows).

Next, there is no definite Unicode encoding. Some Unicode encodings use a fixed width for representing codepoints (like UTF-32), others (such as UTF-8 and UTF-16) have variable lengths (the letter 'a' for instance surely will just use up 1 byte, but apart from the English alphabet, other characters surely will use up more bytes for representation).

So you have to decide what kind of characters you want to represent and then choose your encoding accordingly. Depending on the kind of characters you want to represent, this will affect the amount of bytes your data will take. E.g. using UTF-32 to represent mostly English characters will lead to many 0-bytes. UTF-8 is a better choice for many Latin based languages, while UTF-16 is usually a better choice for Eastern Asian languages.

Once you have decided on this, you should minimize the amount of conversions and stay consistent with your decision.

In the next step, you may decide what data type is appropriate to represent the data (or what kind of conversions you may need).

If you would like to do text-manipulation/interpretation on a code-point basis, char certainly is not the way to go if you have e.g. Japanese kanji. But if you just want to communicate your data and regard it no more as a quantitative sequence of bytes, you may just go with char.

The link to UTF-8 everywhere was already posted as a comment, and I suggest you having a look there as well. Another good read is What every programmer should know about encodings.

As by now, there is only rudimentary language support in C++ for Unicode (like the char16_t and char32_t data types, and u8/u/U literal prefixes). So chosing a library for manging encodings (especially conversions) certainly is a good advice.

Upvotes: 15

testalucida

Reputation: 143

So, if I want to deal with unicode characters, should I use wchar_t?

That depends on what encoding you're dealing with. In case of UTF-8 you're just fine with char and std::string. UTF-8 means the least encoding unit is 8 bits: all Unicode code points from U+0000 to U+007F are encoded by only 1 byte. Beginning with code point U+0080 UTF-8 uses 2 bytes for encoding, starting from U+0800 it uses 3 bytes and from U+10000 4 bytes. To handle this variable width (1 byte - 2 byte - 3 byte - 4 byte) char fits best. Be aware that C-functions like strlen will provide byte-based results: "öö" in fact is a 2-character text but strlen will return 4 because 'ö' is encoded to 0xC3B6.

UTF-16 means the least encoding unit is 16 bits: all code points from U+0000 to U+FFFF are encoded by 2 bytes; starting from U+100000 4 bytes are used. In case of UTF-16 you should use wchar_t and std::wstring because most of the characters you'll ever encounter will be 2-byte encoded. When using wchar_t you can't use C-functions like strlen any more; you have to use the wide char equivalents like wcslen.

When using Visual Studio and building with configuration "Unicode" you'll get UTF-16: TCHAR and CString will be based on wchar_t instead of char.

Upvotes: 5

What exactly can wchar_t represent?

Answers (5)

Related Questions