performance Encoding UTF 8/16 handling Char[] /char* / std::string / BSTR

Question

Quick intro : the question is about UTF-8 vs UTF-16.

*I tried my best to keep it as short and specific as possible please bear with me.

I know there's gazillion of variations of the specific issue UTF-8/16 not mentioning the global encoding subject, which was the start of my questioning (ANSI vs UNICODE) and I guess it's not *MY* quest only, as it could serve many other (performance motivated) beginners in c++.

being more specific - to the point:

giving the Following Environment parameters:

WINDOWS platform
C++ AND C#
using some english /russian/hebrew

*lets say this is a constant.

can I use UTF-8 (half the size of UTF-16) and "get away with it" ?

...saving space and time

TLDR I have recently moved to using C++, in the last few days I have tried to decide how to handle strings which is one of the most expensive datatypes to process, I have followed almost every famous and less famous articles on the encoding issue, though the more i tried to continue searching the more confused I have become, regarding the compatibility, while keeping high performance application without crossing the boundaries of the *framework

I have used the term framework although I am planning to do most of I/O via Native c++ can I use UTF-8 ? do I want UTF-8, i know one thing !

windows 'blood' type is UTF-16, although I think Low Level I/O and also HTTP uses/defaults/prefers/benefits from UTF-8

but I am on windows and still working with .NET

what can I use to max out my apps performance, querying manipulating saving to database...

a point I have read in a less famous [article]

paercebal · Accepted Answer

A bit of research

This is a compilation of research I did to answer your problem:

Hebrew and Cyrillic in Unicode

According to Wikipedia, the Unicode Hebrew block extends from U+0590 to U+05FF and from U+FB1D to U+FB4F (I don't know the proportions): https://en.wikipedia.org/wiki/Unicode_and_HTML_for_the_Hebrew_alphabet

According to Wikipedia, again, the cyrrilic can be found in the following bolcks: U+0400–U+04FF, U+0500–U+052F, U+2DE0–U+2DFF, U+A640–U+A69F, U+1D2B, U+1D78, U+FE2E–U+FE2F https://en.wikipedia.org/wiki/Cyrillic_script_in_Unicode

UTF-8 vs. UTF-16

UTF-16 can represent the following glyphs with two bytes: U+0000 to U+D7FF and U+E000 to U+FFFF, which means all the characters above will be represented with two bytes (a wchar_t on Windows).

To represent Herbew and Cyrillic, UTF-8 will always need at least two bytes, and possibly three:

U+0000 - U+007F : 1 byte
U+0080 - U+07FF : 2 bytes
U+0800 - U+FFFF : 3 bytes

Windows

You said it yourself: Windows's DNA is UTF-16. No matter what delusional websites claim, the WinAPI won't change to UTF-8 because that makes not sense from Microsoft's viewpoint (breaking compatibility with previous Windows' applications just to make Linux lovers happy? Seriously?).

When you will develop under Windows, everything Unicode there will be optimized/designed for UTF-16.

Even the "char" API from the WinAPI is just a wrapper that will convert your char strings into wchar_t strings before calling the UTF-16 you should have been calling directly anyway.

Test!

As your problem seems to be mainly I/O, you should experiment to see if there is a meaningful difference between reading/writing/sending/receiving UTF-16 vs. UTF-8 with sample data.

Conclusion

From every fact above, I see either a neutral choice between UTF-8 and UTF-16 (the russian and cyrillic glyphs) (*), or a choice leading to UTF-16 (windows).

So, my own conclusion, unless your tests show otherwise, would be to stick to UTF-16 on Windows.

^{(*) You could sample a few strings in all the languages you are using, and try to have statistics on the averages the most common characters are used.}

Bonus?

Now, in your stead, I would avoid using directly wchar_t on Windows.

Instead, I would use the _T(), TCHAR and macro/typedef/include machinery offered by Windows: With but a few macros defined (UNICODE and _UNICODE if memory serves), as well as a few smart overloads, you can:

use wchar_t and utf-16 on Windows
use utf-8 on Linux

Which will make your code more portable should you switch to another OS.